Seeking proper algorithms to optimize today’s high computational real‐world problems is a critical and challenging task that has taken a great deal of efforts in the last decade. For instance, Barshandeh and Haghzadeh1 proposed a novel hybrid physics‐based nature‐inspired meta‐heuristic algorithm which named as proposed hybrid optimization algorithm (PHOA). They integrated atom search optimization (ASO) and tree‐seed algorithm (TSA) to successfully optimize traditional meta‐heuristic algorithms, moreover, PHOA was also tested on seven real‐life engineering problems and the results of PHOA were superior among traditional algorithms. In addition, Barshandeh et al.2 proposed a novel hybrid multipopulation algorithm (HMPA) that combined artificial ecosystem‐based optimization (AEO) and Harris Hawks optimization (HHO) algorithms, then, adopted Levy‐flight strategy, local search mechanism, quasi‐oppositional learning, and chaos theory to maximize the efficiency of the HMPA. In their research, HMPA was tested on seven constrained/unconstrained real‐life engineering problems, and the calculation results of HMPA were compared with similar advanced algorithms. The results indicated that HMPA was outperformed the other competitor algorithms significantly. To extend the concepts of Barshandeh and Haghzadeh1 and Barshandeh et al.2 researches, it is critical to seek optimization algorithms in handling real‐life corpus analysis issues, especially during this era of information explosion.
In this modern digital era, corpus building has evolved from manual collection to automatic collection of textual data. To manage its massive textual data, corpus usually combines statistics, machine learning algorithms, or artificial intelligence (AI) techniques; this facilitates the efficiency of data collection, information processing, information retrieval (IR), and so on. Natural languages are one of the most ubiquitous formats of information flow among people. Analyzing, integrating, and reproducing textual data inevitably require importing highly accurate algorithms to process natural languages’ semantics and syntax. Corpus‐based approaches that embed statistical algorithms, such as frequency calculation and log‐likelihood test, are commonly adopted by linguists and data analysts for deciphering linguistic patterns and extracting domain knowledge.3, 4 In addition, in corpus‐based approaches, word ranking is an important technique used to define words’ importance level and to retrieve critical words from the large textual data; this especially helps discover semantic relationships between lexical units.5, 6
In the face of novel diseases, it is essential to build specialized medical corpora for integrating, managing, and retrieving massive information related to the diseases; such corpora help further effectively analyze, react, prevent the diseases. For example, COVID‐19, a novel disease outbreak in December 2019, has a close genetic form with SARS coronavirus (SARS‐CoV), and has caused over 40 million confirmed cases and 1 million deaths by the end of October 2020 (less than a year).7–12 Leading researchers from various countries are trying to unveil the mystery of the novel disease. As of the end of October 2020, Web of Science (WOS), an internationally renowned academic database, has published more than 35,000 COVID‐19‐related research articles (RAs); this number keeps rising. No doubt, governments around the world are seeking direct and effective measures to mitigate the pandemic and speed up the cure of the confirmed cases.13, 14 With big textual data about COVID‐19 being rapidly distributed, it is critical for humans to rely on machine algorithms to compute important semantic information, thereby, filtering and retrieving critical messages.15, 16 Hence, adopting corpus‐based approaches to process and integrate COVID‐19‐related English‐mediated textual data will enhance frontline medical personnel’s efficiency of knowledge acquisition and perception.
Since the advent of computer technology, the practicality of corpus‐based approaches has received widespread attention and adoption in textual information analysis fields. Frequency criterion is considered as one of the core analytical techniques in corpus‐based approaches. However, simply relying on tokens’ frequency values to determine their importance may be insufficient; tokens’ dispersion and concentration conditions also need to be taken into consideration. For example, in terms of importance, a word occurring 100 times in an RA is not equal to a word occurring 10 times each in 10 RAs because words’ dispersion and concentration conditions are different. A potential solution that adopts Hirsch index (H‐index) algorithm to integrate and compute the criteria of dispersion and concentration is required to address this issue. H‐index algorithm was originally used to quantify the accumulative impacts and relevance of a researcher’s scientific research achievements.17–23 Nevertheless, this algorithm was not only limited to the purposes of evaluating academic achievements but also seen its applications in the fields of risk assessment,22 medical,24 and so forth.
Handling critical word‐ranking issues using traditional frequency‐based approaches may cause distortion and bias because those approaches neither refine the corpus data nor simultaneously compute words’ frequency dispersion and concentration criteria, hence, the alleged highly important words with high frequency would be challenged. Thus, this paper proposed a novel corpus‐based approach that integrates a corpus software and H‐index algorithm as a computation method and evaluation metric that can enhance the accuracy of word ranking, compensate the deficiency of the traditional frequency‐based approaches, and further augment the efficacy of corpus‐based analysis. To verify the proposed approach, 100 COVID‐19‐related medical RAs with Science Citation Index (SCI) from WOS were retrieved and compiled as the big textual data and an empirical example which was embedded into the proposed approach. The main reason the researchers adopted this empirical example was that SCI journals represent high‐quality academic publications. In addition, understanding the specific linguistic pragmatics of medical RAs will assist frontline healthcare personnel in processing and acquiring important COVID‐19 medical messages.
The remainder of this paper is organized as follows: Section 2 describes preliminaries, explains the theoretical framework, and introduces the recent novel disease, COVID‐19. Section 3 describes detailed steps of the proposed approach. Section 4 uses COVID‐19‐related RAs from WOS as the big textual data (i.e., the target corpus) and as an empirical example to verify the proposed approach. Section 5 is the concluding part of this study.
2.1 Conventional frequency‐based corpus analysis
With the advance of computer technology, corpus development has enabled people to establish algorithms to integrate, manage, and process natural languages from massive textual data, thereby driving the progress of natural language processing (NLP) and AI‐related industries. O’Keeffe et al.25 noted that information on frequency counts of tokens is the basis for understanding core vocabularies that native speakers use frequently and the common combinations of vocabulary usage. Collecting large data (corpora) from native speakers’ written texts and discourse transcripts will provide strong evidence for understanding their linguistic patterns. Moreover, ranking words based on their frequency will show the words that are adopted by the majority and the words that are used in day‐to‐day communications.26, 27 Hence, frequency‐based corpus analytical approaches have widely been adopted by linguists, sociologists, text analysts, and so on for extracting strong linguistic evidence for interpreting cultural phenomenon, jargon, genre type, and so on.28, 29 For example, Le and Miller6 adopted Sketch Engine, a corpus software, to cross‐compare four medical corpus sources to extract the most frequently occurring medical morphemes in medical RAs. The resulting data indicated 136 specialized medical morphemes that account for 8.5% of the lexical items in the Medical Web Corpus, and the results offered English as a Foreign Language (EFL) medical students a useful academic resource for enhancing their comprehension of English medical vocabulary. Grabowski5 used WordSmith Tools 5.0, a corpus software, to present a corpus‐driven description of the use and functions of top‐50 keywords (i.e., based on keyness values) complemented by a similar description of top‐50 lexical bundles (LBs; based on frequency values) in the analysis of specialized corpus which contains patients’ prescriptions, outlines of product introduction, clinical trial protocols, and pharmacological RAs. The results provided significant pedagogical value for English for specific purposes (ESP) students and EFL practitioners in the pharmaceutical domain.
Traditional corpus‐based approach was designed for effectively clarifying, categorizing, and interpreting the patterns of natural languages. Computing word frequency is thus a critical technique that corpus software is capable of (see Equation 1).
2.2 H‐index algorithm
H‐index algorithm was proposed by Jorge E. Hirsch,19 a physicist and a professor at the University of California, San Diego in 2005. H‐index is an evaluation mechanism that is used to measure a researcher’s academic productivity and the citation rate of published articles; the index h is given to represent the number of papers with citation number more than h, it is a useful index to quantify the academic achievements of a researcher. Nowadays, this mechanism has been widely adopted in several academic databases, such as WOS, Google Scholar, Scopus, and even other research fields.18, 20, 22 The algorithm computes the interrelationships between publication quantities and numbers of citations, and defines a researcher’s academic influence in certain domain. For example, Li et al.22 adopted H‐index algorithm to assess the significance of the urban railroad network structure, which took topology, passenger quantity, and passenger flow correlation of Beijing urban railroad network into consideration to refine rail network structure and decrease operational risks. Gao et al.17 proposed a weighted H‐index (hw) by constructing an operator H on weighted edges. Moreover, the accumulation of weighted H‐index (sh) in the node’s neighborhood defines the spreading influence, then utilized the susceptible–infected–recovered (SIR) model to investigate an epidemic spreading process on 12 real‐world networks, and to further define the most influential spreaders. Hanna et al.24 developed a novel metric for quantifying patient‐level utilization of emergency department (ED) imaging. In their research, H‐index was adopted to measure a patient’s annual ED imaging volume, and the resulting data of patients’ H‐index values were used as the referential data for mitigating imaging‐related costs and improving throughput in the ED. In summary, H‐index algorithm integrates multiple considerations to evaluate and to create the values of importance of the research objects, moreover, the definition of Hirsch’s H‐index algorithm is defined as follows:
Definition 2. ((Hirsch))If the value of function f represents citation times of each paper and is ranked in descending sequence (see Equation 2), then find f(n) equal to or larger than n (see Equation 3). The value of H‐index has to satisfy this criterion, and can be described as follows:
where n is the paper numbers,
is the citation times of the paper, and
represents citation times of each paper ranked from maximum to minimum.
To understand this algorithm, two examples are given as follows:
Example 1.If a researcher has 10 published articles (n = 10) identified as
, and the citation numbers are randomly given as 9, 5, 50, 20, 6, 8, 6, 4, 1, 0, thus, f(A1) = 9, f (A2) = 5, f (A3) = 50, f (A4) = 20, f (A5) = 6, f (A6) = 8, f (A7) = 6, f (A8) = 4, f (A9) = 1, f (A10) = 0. Then, rerank the citation numbers in descending sequence, and they become f (b1) = 50, f (b2) = 20, f (b3) = 9, f (b4) = 8, f (b5) = 6, f (b6) = 6, f (b7) = 5, f (b8) = 4, f (b9) = 1, f (b10) = 0. The results indicate that b6 satisfies the criteria of Equation (2) where f (b6) ≥ 6, thus H‐index = 6 (see Table 1).
Example 2.The illustrative diagram (see Figure 1) also explains the H‐index algorithm; there is a reference line (i.e., it represents that the n paper needs to have at least n citations) on the diagram, the papers’ citations have to be over or on the reference line to be included into the value of H‐index. f(b6), in this case, is the sixth paper and is also the last paper on the reference line. Meanwhile, its citation time is six and it satisfies Equation (2), f(b6) ≥ 6, thus, the value of H‐index is equal to 6.
In summary, H‐index algorithm presents the estimation of the significance, importance, and wide influence of a researcher’s cumulative academic contributions. It has become a standard measurement and a criterion that is unbiased to compare and to evaluate the academic achievements of researchers who are competing in the same research fields.19
|Original data||Computing process||H‐index result|
|Research paper||Citation times||Research paper||Citation time|
COVID‐19, whose original nomenclature was SARS‐CoV‐2, was renamed by WHO in February 2020. The clusters of first cases of the virus were discovered in Wuhan city, Hubei province, China.7 Epidemiologists, for now, propose a possibility that the virus which was originally carried by wild animals entered to human‐to‐human transmission routes because locals in the city have preference for “Yeh‐Wei”, meats of wild animals, such as bats, birds, and rodents.8, 10 Upon visiting the possible source location of COVID‐19, Huanan market, medical experts found plenty of contaminated carcasses of wild animals stocked and piled for sale. Thus, medical and biological experts speculated that the novel coronavirus may constantly mutate in animal hosts (e.g., bats, pangolins, etc.), then become capable of infecting humans, especially when people process animal carcasses or eat uncooked food ingredients that host the virus.8 Indeed, many studies have indicated that bats were the initial hosts of COVID‐19 because it has over 90% similarity to two SARS‐like coronaviruses from bats, bat‐SL‐CoVZX45 and bat‐SL‐CoVZX21.9, 12 In terms of etiology, COVID‐19 has a genetic form similar to SARS‐CoV (i.e., an acute respiratory syndrome coronavirus which broke out in 2002) and MERS‐CoV (i.e., middle east respiratory syndrome coronavirus which broke out in 2012),12, 32 but its spike (S) protein has mutated and enabled it to attack the host’s immune system, making the host too weak to resist the virus.33 The comparison of COVID‐19 and two prior coronaviruses shows that COVID‐19 causes a low fatality rate but has extremely high infectious capability.34 Yi et al.12 also pointed out that the majority of the human population lacks the immunity of COVID‐19 and is thus susceptible to the novel coronavirus.
Reverse transcriptase polymerase chain reaction (RT‐PCR) was initially adopted as the primary criteria for diagnosing COVID‐19. However, RT‐PCR test method has a high probability of misdiagnosis that may accelerate the pandemic, thus, multiple diagnosing test approaches were integrated with the investigations of travel history survey, disease records, clinical symptoms (see Figure 2), lab tests, and X‐ray or computed tomography (CT) for making effective diagnoses.35 Following the intensification of the COVID‐19 pandemic, rapid test toolkits were invented to rapidly detect RNA, antigen, or antibody of SARS‐CoV‐2, giving more time to frontline healthcare personnel to respond and cure the confirmed cases. In addition, prior studies pointed out that without protective measures (i.e., surgical masks, respiratory filtrations, etc.), three major transmission routes of inhalation, droplet, and contact routes will cause 57%, 35%, and 8.2% of COVID‐19 infection probability.36 For frontline healthcare personnel, in particular, who treat confirmed cases and have prolonged exposure to the virus emission environment and inhalation of droplets (<10 μm) that contain the virus, their possibility of infection may reach over 80%.37 Prior research also showed that social distance (1.5–2 m) will not be effective if the virus emission source does not wear any protective equipment because the virus can be spread at least 6 m away via patients’ coughing and sneezing.38, 39 Hence, even though the fatality rate of COVID‐19 is not extremely high, high infection rates cause difficulties in pandemic response and prevention.
According to WHO, as of October 31, 2020, there were 45,408,704 confirmed COVID‐19 cases and 1,179,363 COVID‐19 deaths (see Figure 3). Because targeted therapeutic medicines are still being developed, governments can only presently rely on quarantine policies, and existing indirect medical treatments, thus, making citizens pay attention to personal hygiene, implementing border control measures, encouraging social distance and internet shopping, and so on to decrease close contacts between people and control the COVID‐19 pandemic.40–42
COVID‐19, at the time of this writing, is still a semi‐unknown novel disease for medical experts and continues to be explored. To effectively manage the massive medical textual information about it, it is necessary to create a COVID‐19‐specialized corpus, integrating appropriate algorithms for information processing and mining.
Traditional corpus‐based computing methods for critical word ranking mainly calculate words’ frequency values and rank them. Prior studies believed high‐frequency words may reflect specific linguistic patterns in certain domains which would benefit EFL speakers in more effective acquisition of domain knowledge when reading English texts.3, 5, 6, 43, 44 Thus, with rapid information flow of COVID‐19, establishing COVID‐19 specialized corpus for timely acquisition of updated medical knowledge is especially critical for medical care personnel.7, 9, 11, 14, 32 Certainly, as of the end of October 2020, more than 38,000 RAs on COVID‐19‐related topics had been published in the WOS database; this phenomenon indicated that a large number of research results were produced by leading researchers globally. To effectively integrate and decipher the English‐mediated professional textual information and to further improve the efficiency of knowledge acquisition, importing algorithms to compute key natural language semantics is quite critical. Corpus‐based and NLP technology hence plays the essential roles at this time for humans to efficiently process the big textual information available.25, 45
However, taking existing corpus software, such as AntConc 3.5.8,30 WordSmith Tools 5.0, and so forth, as examples, within its existing algorithms, those are still unable to simultaneously compute these two conditions. Their word‐ranking results can only base on frequency value or range value, respectively, hence to make the evaluation of words’ importance level exist bias. Therefore, to compensate for the results bias in word‐ranking issues of the traditional methods, the researchers propose a novel corpus‐based approach that integrates AntConc 3.5.830 and H‐index algorithm19 to compute and to evaluate the importance of tokens.
The steps are as follows: in the initial stage of the proposed approach, sample and compile the textual data as the target corpus in a way that suitable for H‐index algorithm. Then, adopt Chen et al.’s46 corpus‐based optimizing approach to refine the target corpus. In the middle part of the proposed approach, use AntConc 3.5.830 to compute tokens’ frequency values and ranges, then, adopt H‐index algorithm to integrally compute tokens’ dispersion and concentration conditions, and to further obtain their H‐index values. Next, rank tokens based on their H‐index and frequency values. Postranking results will shed light on the importance of the proposed approach and imply the future possible applications in corpus‐based and NLP fields. There are six steps in total in the proposed approach, moreover, detailed descriptions are shown as follows (see Figure 4):
Step 1. Compiling suitable categorization of the big textual data for H‐index analysis.
H‐index algorithm is mainly used to explore the citation rate of research papers. In this study, the authors adopt it to explore the usage rate of tokens. In this step, the target corpus (i.e., the big textual data) should be segmented into its basic elements that consider an article as a unit instead of compiling all files into a big file (see Figure 5). Hence, the H‐index of tokens will be computed successfully.
Step 2. Extracting tokens from the big textual data.
Using AntConc 3.5.8 as the corpus software to calculate and unveil the composition of the big textual data, the quantitative data will be retrieved and all tokens will be labeled with numbers in this step.
Step 3. Optimizing the big textual data.
Function and meaningless words would decrease the efficiency of corpus‐based approaches, hence to retrieve the substantive words which most reflect domain information, a refining process is inevitable. In this step, adopt the function wordlist and machine optimizing process to refine the big textual data,46 the remaining content words will be processed in subsequent steps.
Step 4. Ranking tokens based on individual overall frequency criteria.
After calculating each token’s overall frequency based on Equation (1) by the corpus software, the wordlist in this step will be ranked based on frequency criteria, from highest to lowest frequency sequences.
Step 5. Ranking tokens based on H‐index algorithm.
In this step, the researchers adopt the H‐index algorithm to compute the significance of tokens. Here, the citation times are considered as the tokens’ adoption times (i.e., frequency), thus, the calculation of tokens’ H‐index is based on a token appearing equal to or more than n times in n RAs. First, based on Equation (2), rank the word frequency of each RA in descending order. Then, based on Equation (3), find a word’s H‐index value that satisfies the criteria.
Step 6. Integrating tokens’ ranking information for future extended applications.
Ranking tokens based on their H‐index values in descending order.
If tokens have the same H‐index values, then rank their frequency values in descending order.
The proposed approach uses H‐index algorithm to compute a token’s degree of importance, simultaneously taking the criteria of dispersion and concentration into consideration. In addition, when facing the same H‐index values, use tokens’ frequency values to define their ranks to avoid hesitation that occurs when defining tokens’ degree of importance.
4 EMPIRICAL STUDY
4.1 Overview of the compiled big textual data
The big textual data in this paper are 100 RAs that were collected from WOS. This choice was due to WOS that is one of the largest, well‐known, and leading databases in the world. Moreover, many academic big textual data analysis researches and NLP researches of scientific fields adopted RAs from WOS as test data.47–49 Hence, in this study, the researchers chose Medicine, General, and Internal, a category that defined journal citation reports (JCR) for WOS, they then focused on open access (OA) journals (N = 24). To process these 24 journals, first, the authors calculated their respective annual publications (data retrieved from 2019.9.1 to 2020.8.31), then, calculated the number of papers that were related to the COVID‐19 topic. Finally, they sampled the newest articles from each journal based on ratio and they further compiled the big textual data (see Table 2). The research fields of the sampled journals comprise (1) environmental sciences, (2) public, environmental, and occupational health, (3) infectious diseases, (4) tropical medicine, (5) microbiology, (6) toxicology, (7) healthcare sciences and services, and (8) health policy and services. Furthermore, the collected RAs all had COVID‐19 in their titles, and they discussed problems and solutions during the COVID‐19 pandemic in line with their research fields. The paper collecting method in this study attempted to reach a balance between domain and genre type as much as possible to make native and EFL healthcare personnel understand the most important and widely used tokens in medical RAs.
|Topic||Category||Journal||Annual publication||COVID‐19‐related RAs||Actual collected articles|
|COVID‐19||Medicine, General, and Internal||International Journal of Environmental Research and Public Health||7683||253||41|
|Frontiers in Public Health||539||94||15|
|Journal of Global Health||228||45||7|
|Lancet Global Health||399||43||7|
|Lancet Public Health||173||41||7|
|Journal of Infection and Public Health||252||27||4|
|Asian Pacific Journal of Tropical Medicine||102||22||4|
|BMJ Global Health||327||13||2|
|Annals of Global Health||97||13||2|
|Globalization and Health||108||12||2|
|Journal of Nepal Medical Association||172||11||2|
|BMC Public Health||1817||8||1|
|Journal of Epidemiology||79||5||1|
|Antimicrobial Resistance and Infection Control||195||5||1|
|Australian and New Zealand Journal of Public Health||114||5||1|
|Archives of Public Health||91||4||1|
|Environmental Health Perspectives||175||3||1|
|Conflict and Health||79||2||0|
|Tobacco Induced Diseases||65||2||0|
|Environmental Health and Preventive Medicine||70||1||0|
|Safety and Health at Work||68||1||0|
- Abbreviation: RA, research article.
4.2 Traditional corpus‐based computing method for handling critical word‐ranking issues
AntConc 3.5.830 works like other corpus software; based on Equation (1), it cumulates the sum of words’ occurrence times (i.e., frequency values) in the corpus and ranks words. Using the compiled corpus as an example, the traditional method for handling critical word‐ranking issues will cause the following problems: (1) function and meaningless words are not eliminated, hence content words are ranked behind and this decreases analytical efficiency, (2) the dispersion condition of frequency is not taken into consideration, (3) the concentration condition of frequency is not taken into consideration. Word‐ranking results in Figure 6 indicate that the wordlist is based on words’ overall frequency values and ranked in descending orders.
4.3 The proposed approach
In this section, the compiled big textual data are embedded into the proposed novel corpus‐based approach for calculating the actual results of the proposed approach. A detailed description is shown as follows:
Step 1. Compiling suitable categorization of the big textual data for H‐index analysis.
To effectively compute the H‐index values of each token, the composition of the corpus should consider each article as a unit. To manage the big textual data, first, the researchers gave each journal a codename. For example, Annals of Global Health was coded as AGH. The purpose of coding journal names was for rapidly and effectively retrieving sources of tokens, hence, increasing the efficiency of text analysis and mining. Second, the file name of each article paper is given based on a specific rule, for instance, 01. In AGH‐01, 01 means the RA’s serial number (i.e., from the perspective of the entire big textual data), AGH means journal codename, and −01 represents the RA’s serial number in the current journal (see Table 3).
|Journal name||Codename||Data management of RAs|
|Annals of Global Health||AGH||01. AGH‐01, 02. AGH‐02|
|Australian and New Zealand Journal of Public Health||ANZJPH||03. ANZJPH‐01|
|Archives of Public Health||APH||04. APH‐01|
|Asian Pacific Journal of Tropical Medicine||APJTM||05. APJTM‐01, 06. APJTM‐02, 07. APJTM‐03, 08. APJTM‐04|
|Antimicrobial Resistance and Infection Control||ARIC||09. ARIC‐01|
|BMC Public Health||BMCPH||10. BMCPH‐01|
|BMJ Global Health||BMJGH||11. BMJGH‐01, 12. BMJGH‐02|
|Environmental Health Perspectives||EHP||13. EHP‐01|
|Frontiers in Public Health||FPH||14. FPH‐01, 15. FPH‐02, 16. FPH‐03, 17. FPH‐04, 18. FPH‐05, 19. FPH‐06, 20. FPH‐07, 21. FPH‐08, 22. FPH‐09, 23. FPH‐10, 24. FPH‐11, 25. FPH‐12, 26. FPH‐13, 27. FPH‐14, 28. FPH‐15|
|Globalization and Health||GAH||29. GAH‐01, 30. GAH‐02|
|International Journal of Environmental Research and Public Health||IJERPH||31. IJERPH‐01, 32. IJERPH‐02, 33. IJERPH‐03, 34. IJERPH‐04, 35. IJERPH‐05, 36. IJERPH‐06, 37. IJERPH‐07, 38. IJERPH‐08, 39. IJERPH‐09, 40. IJERPH‐10, 41. IJERPH‐11, 42. IJERPH‐12,|
|43. IJERPH‐13, 44. IJERPH‐14, 45. IJERPH‐15, 46. IJERPH‐16, 47. IJERPH‐17, 48. IJERPH‐18, 49. IJERPH‐19, 50. IJERPH‐20, 51. IJERPH‐21, 52. IJERPH‐22, 53. IJERPH‐23, 54. IJERPH‐24,|
|55. IJERPH‐25, 56. IJERPH‐26, 57. IJERPH‐27, 58. IJERPH‐28, 59. IJERPH‐29, 60. IJERPH‐30, 61. IJERPH‐31, 62. IJERPH‐32, 63. IJERPH‐33, 64. IJERPH‐34, 65. IJERPH‐35, 66. IJERPH‐36,|
|67. IJERPH‐37, 68. IJERPH‐38, 69. IJERPH‐39, 70. IJERPH‐40, 71. IJERPH‐41|
|Journal of Global Health||JGH||72. JGH‐01, 73. JGH‐02, 74. JGH‐03, 75. JGH‐04, 76. JGH‐05, 77. JGH‐06, 78. JGH‐07|
|Journal of Infection and Public Health||JIPH||79. JIPH‐01, 80. JIPH‐02, 81. JIPH‐03, 82. JIPH‐04|
|Journal of Nepal Medical Association||JNMA||83. JNMA‐01, 84. JNMA‐02|
|Journal of Epidemiology||JOE||85. JOE‐01|
|Lancet Global Health||LGH||86. LGH‐01, 87. LGH‐02, 88. LGH‐03, 89. LGH‐04,|
|90. LGH‐05, 91. LGH‐06, 92. LGH‐07|
|Lancet Public Health||LPH||93. LPH‐01, 94. LPH‐02, 95. LPH‐03, 96. LPH‐04, 97. LPH‐05, 98. LPH‐06, 99. LPH‐07|
|Reproductive Health||RH||100. RH‐01|
- Abbreviation: RA, research article.
Step 2. Extracting tokens from the big textual data.
Data management of the first step indicated that the principle of coding provides huge convenience when launching AntConc 3.5.8 to process corpus data. The corpus software analyzed all RAs’ word types, tokens, and lexical diversity (i.e., types and tokens ratio, TTR; see Table 4). The lexical results of the compiled big textual data indicated that authors from 100 RAs adopted 13,062 word types, and the whole corpus is composed of 366,866 running words. Furthermore, its TTR is approximately equal to 0.0356 (also see Table 4).
|Compiled big textual data||Word types||Tokens||TTR|
|Data codename||Numbers of paper|
- Abbreviation: TTR, types and tokens ratio.
Step 3. Optimizing the big textual data.
On the basis of Chen et al.’s46 research, function words, such as a, an, the, it, is, and so on, would decrease the efficiency of text mining and IR. Indeed, no matter which algorithm is used to calculate the importance of tokens, the irreplaceability of function words in constructing meaningful sentences will cause them to appear in resulting data or even be ranked very high, which directly decreases the accuracy and efficiency of information processing. Thus, the researchers adopted Chen et al.’s46 big textual data refining approach to optimize the compiled big textual data; the refined wordlist on the corpus software shows that meaningful words are ranked to the front (see Figure 7). In addition, the data discrepancy showed that word types of refined data decreased by 238 words (i.e., function words), nevertheless, tokens of refined data decreased 157,911 words, which caused a 43% downsizing in the corpus. Moreover, the lexical diversity was enhanced to 0.0614 (see Table 5). Unexpectedly, when facing highly specialized medical RAs, function words also occupied more than 40% of the corpus. To avoid information distortion, the eliminating procedure for function words is inevitable.
|Lexical feature||Original data||Refined data||Data discrepancy|
|Word types||13,062||12,824||−238 (−1.8%)|
- Abbreviation: TTR, types and tokens ratio.
Step 4. Ranking tokens based on individual overall frequency criteria.
After optimizing the compiled big textual data, the authors adopted the refined traditional corpus‐based computing method30 to compute the sum of frequency values of each token (see Figure 7), and to find out each token’s frequency values in each RA by the Concordance Plot function of the corpus software. In the Concordance Plot, Concordance Hit represents a token’s overall frequency values, and Total Plot (with hits) represents how many RAs adopted a token. Take COVID as an example, its Concordance Hit is 3520 (i.e., overall frequency values) and Total Plot (with hits) is 100 which means COVID was adopted by 100 RA authors (see Figure 8). Hence, in this step, the authors obtained three important factors which include overall frequency values, frequency values in each RA, and how many RAs adopted a token. These factors are critical and will be calculated by the H‐index algorithm in the following step.
Step 5. Ranking tokens based on H‐index algorithm.
In this step, the researchers used the wordlist to compute tokens (N = 420) that had frequency values over 100. Take mortality as an example, the authors recorded frequency values of mortality of each RA as original data, and sorted each frequency from highest to lowest, then it was found that
; that satisfied the criteria of Equation (3), thus, the value of H‐index was given as 9 (see Table 6). This computing approach is used to calculate a token’s overall adopting rates and evaluate its importance level more accurately. Then, they recorded tokens’ H‐index values in Excel software for a ranking process.
|Token||Original data||Computing process||H‐index result|
It was found that after using the H‐index values to rank tokens, the sequences of the wordlist had been changed significantly because H‐index calculated authors’ adoption rate in each RA and reinterpreted the importance of tokens. However, tokens’ H‐index values often produced the same value. If the same H‐index values are encountered, the authors would sort tokens by their frequency values again. That is, this paper considers H‐index and frequency values simultaneously to make the important calculation of tokens more accurate.
Step 6. Integrating tokens’ ranking information for future extended applications.
The wordlist of Step 5 showed the combinations of token’s H‐index and frequency values. The tokens’ ranking issue handled by the proposed approach redefine their importance level, hence, these data provide the important referential indicators for future applications, such as IR, NLP, big data analysis, machine learning, deep learning, and so on. By this study, the authors propose a novel corpus‐based approach that integrates a corpus software and H‐index algorithm to calculate which tokens are important in medical RAs. The resulting data will improve native and EFL medical researchers’ learning and processing efficiency of medical RAs.
4.4 Comparison and discussion
Refining corpus data
According to Table 8, raw data contain many functions and meaningless tokens, such as the, of, and, to, in, and so forth. The traditional frequency‐based approach30 calculated all tokens’ frequency values, it was unable to identify which tokens contain more substantial meanings for humans. To enable the corpus‐based approaches to rank critical words with substantial meanings, the refined traditional frequency‐based approach46 and the proposed approach have eliminated function and meaningless words. Hence, based on Table 8, refined data show content words that have general or domain‐oriented purposes. It makes corpus analytical results more meaningful and enhances its efficiency in retrieving critical words.
Calculating frequency dispersion criteria
|Raw data||Refined data|
|The traditional frequency‐based approach30||The refined traditional frequency‐based approach44||The proposed approach|
The authors adopted the proposed approach to compute the top 420 tokens whose frequency values reached more than 100, respectively, from the wordlist of the refined data. According to Table 8, there were significant differences in token ranking between the traditional corpus‐based computing approaches30, 46 and the proposed approach. The traditional corpus‐based computing approaches30, 46 only calculated a token’s total frequency values to define its rank and importance; however, the frequency dispersion criteria were not taken into consideration; that is, a token with high frequency may not be widely adopted or used by the RA authors, or may be concentrated in very few RAs or even possibly occur in only one RA. Nevertheless, the proposed approach not only used H‐index to compute the dispersion and concentration criteria of frequency simultaneously, but also used frequency values to distinguish tokens that had the same H‐index values. Therefore, after taking all criteria into considerations, the proposed approach is more rigorous and accurate. Interestingly, tokens, such as COVID, health, study, pandemic, reported, infection, population, participants, and case, still remain in their original ranks when compared with the refined traditional frequency‐based approach and the proposed approach; that is, after being calculated using the two approaches, their frequency and H‐index values were both extremely high, hence those tokens’ importance was unquestionable.
The calculation results of the proposed approach redefine the importance of tokens (N = 420) that were compared with the traditional corpus‐based computing approaches.30, 46 In other words, the authors found only 11 tokens (2.6%) that remained at original ranks and only nine tokens (2.1%) among them in the top 50 wordlists (see Table 8), 15 tokens (3.5%) that moved forward more than 100 ranks, respectively, 196 tokens (46.6%) that moved forward from 1 to 99 ranks, respectively, 14 tokens (3.3%) that moved backward more than 100 ranks, respectively, and 184 tokens (43.8%) that moved backward from 1 to 99 ranks, respectively. In other words, the proposed approach successfully re‐evaluates the importance of tokens and makes more than 97% changes by adopting H‐index algorithm which simultaneously took the dispersion and concentration criteria of frequency into consideration (see Table 9).
|Data discrepancy||Token numbers||Proportion|
|Tokens stay at the original ranks||11||0.0262|
|Tokens move forward more than 100 ranks||15||0.0357|
|Tokens move forward from 1 to 99 ranks||196||0.4667|
|Tokens move backward more than 100 ranks||14||0.0333|
|Tokens move backward from 1 to 99 ranks||184||0.4381|
|Tokens’ H‐index value equal to 1||2||0.0048|
The proposed approach can also handle tokens’ frequency concentration criteria. For example, as discovered, hyponatremia was ranked at 231 based on its calculation results in the traditional corpus‐based computing approaches30, 46 (frequency = 153), and tobacco was ranked at 391 based on its calculation results in the traditional corpus‐based computing approaches30, 46 (frequency = 104). Nevertheless, after computing by the proposed approach, both words’ H‐index values were equal to 1 (see Table 9); hence, their post rank moved backward at 419 and 420, respectively (i.e., they became the last important two words among 420 tokens), they moved backward by 188 and 29 sequences, respectively. Even if hyponatremia and tobacco had more than 100 occurrence times in the compiled big textual data, they were adopted by only one RA each. In other words, their importance was almost negligible because there is extremely low probability that people will encounter those two words in future COVID‐19‐related RAs. Therefore, the traditional corpus‐based computing approaches30, 46 again overestimated the tokens’ importance level.
To conclude this section, tokens’ importance level computation has affected the analysis and development of big data management and processing, search engines, and other relative AI industries. If the frequency value is the only criteria for ranking tokens’ importance level, the assessment of their importance will be inaccurate and distorted. Hence, we proposed the novel corpus‐based approach in this paper, which integrates a corpus software and H‐index algorithm to take tokens’ frequency dispersion and concentration criteria into consideration simultaneously, thus, accurately and comprehensively handling the token ranking issue.
Traditional corpus‐based computing methods still present some analytical doubts during corpus processing, for example, refining corpus data, computing frequency dispersion criteria, and computing frequency concentration criteria. Those may cause a decrease in corpus data processing efficiency, and more seriously, the evaluation of tokens’ importance level may be biased as frequency value is the only indicator used for handling word‐ranking issues in traditional corpus‐based computing methods. Thus, to compensate the blind side of the traditional methods, this paper proposed a novel corpus‐based approach that integrates a corpus software and H‐index algorithm to refine corpus data, to calculate tokens’ frequency dispersion and concentration criteria, and further to handle word‐ranking issues.
The significant contributions of the proposed approach are listed as: (1) the proposed approach is able to refine corpus data via machine processing to eliminate function and meaningless words, (2) the proposed approach is able to compute tokens’ frequency dispersion criteria; moreover, when facing tokens with the same H‐index values, tokens’ frequency values are the second criteria used to rank, hence, it makes word‐ranking process more accurate and to avoid hesitance situations occurring in the ranking process, (3) the proposed approach is able to compute tokens’ frequency concentration criteria, such as in cases where a token has high‐frequency values but is overconcentrated in certain RAs; hence, H‐index = 1 indicates that H‐index algorithm precisely evaluates a token’s importance level, whilst, frequency values overestimate a token’s importance level and cause ranking results distortion. Furthermore, in relation to textual analysis in COVID‐19‐related RAs, the proposed approach also helps native and EFL frontline healthcare personnel to integrate and retrieve professional medical knowledge, and to further enhance their information processing efficiency.
This paper exists a major limitation that is waiting for future researches to overcome, for example, without the assistant of existing software, H‐index computing process still relies on human processing, once the data are too bounteous, it will cause a great burden on data analysts. Hence, in terms of future perspective, this paper suggests that future corpus‐based and NLP research can import H‐index algorithm to corpus program (i.e., software) for processing big textual data. It will enhance accuracy and efficiency in handling word‐ranking issues, and aid accurate retrieval of critical words from the big textual data.
The authors would like to thank the Ministry of Science and Technology, Taiwan, for financially supporting this study under Contract Nos. MOST 108‐2410‐H‐145‐001 and MOST 109‐2410‐H‐145‐002.
CONFLICT OF INTERESTS
The authors declare that there is no conflict of interests.
- 1, A new hybrid chaotic atom search optimization based on tree‐seed algorithm and Levy flight for solving optimization problems. Eng Comput. 2020. https://doi.org/10.1007/s00366-020-00994-0
- 2, , . HMPA: an innovative hybrid multi‐population algorithm based on artificial ecosystem‐based and Harris Hawks optimization algorithms for engineering problems. Eng Comput. 2020. https://doi.org/10.1007/s00366-020-01120-w
- 3, . A corpus‐based approach to online materials development for writing research articles. Engl Specif Purp. 2011; 30(3): 222‐ 234.
- 4, Peer‐to‐peer prescriptions in medical sciences: Iranian field specialists’ attitudes toward convenience editing. Engl Specif Purp. 2017; 45: 86‐ 97.
- 5 Keywords and lexical bundles within English pharmaceutical discourse: a corpus‐driven description. Engl Specif Purp. 2015; 38: 23‐ 33.
- 6, A corpus‐based list of commonly used English medical morphemes for students learning English for specific purposes. Engl Specif Purp. 2020; 58: 102‐ 121.
- 7, , , et al. The species severe acute respiratory syndrome‐related coronavirus: classifying 2019‐nCoV and naming it SARS‐CoV‐2. Nat Microbiol. 2020; 5(4): 536‐ 544.
- 8, , , et al. Clinical characteristics of coronavirus disease 2019 in China. N Engl J Med. 2020; 382(18): 1708‐ 1720.
- 9, , , et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet. 2020; 395(10224): 565‐ 574.
- 10, , , . A novel coronavirus outbreak of global health concern. Lancet. 2020; 395(10223): 470‐ 473.
- 11, , , et al. Virological assessment of hospitalized patients with COVID‐2019. Nature. 2020; 581(7809): 465‐ 469.
- 12, , , , . COVID‐19: what has been learned and to be learned about the novel coronavirus disease. Int J Biol Sci. 2020; 16(10): 1753‐ 1766.
- 13, , , . Early treatment of COVID‐19 disease: a missed opportunity. Infect Dis Ther. 2020; 9(4): 715‐ 720.
- 14 Covid‐19: Americans afraid to seek treatment because of the steep cost of their high deductible insurance plans. BMJ—Br Med J. 2020; 371:m3860.
- 15, , . An overview of literature on COVID‐19, MERS and SARS: using text mining and latent Dirichlet allocation. J Inf Sci. 2020; 2020:0165551520954674.
- 16, , . Identifying #addiction concerns on Twitter during the COVID‐19 pandemic: a text mining analysis. Subst Abus. 2021; 42(1): 39‐ 46. https://doi.org/10.1080/08897077.2020.1822489
- 17, , , , . Weighted h‐index for identifying influential spreaders. Symmetry—Basel. 2019; 11(10): 1263.
- 18, , , . Quantitative analysis of automatic performance evaluation systems based on the h‐index. Scientometrics. 2020; 123(2): 735‐ 751.
- 19. An index to quantify an individual’s scientific research output. Proc Natl Acad Sci USA. 2005; 102(46): 16569‐ 16572.
- 20, , , . Which h‐index? An exploration within the Web of Science. Scientometrics. 2020; 123(3): 1225‐ 1233.
- 21, , , , Using the bootstrapping method to verify whether hospital physicians have different h‐indexes regarding individual research achievement a bibliometric analysis. Medicine (Baltimore). 2020; 99(33):e21552.
- 22, , . Measuring method of node importance of urban rail network based on h index. Appl Sci—Basel. 2019; 9(23): 5189.
- 23, , , Scientific quality index: a composite size‐independent metric compared with h‐index for 480 medical researchers. Scientometrics. 2019; 119(2): 1009‐ 1016.
- 24, , , , , The introduction and development of the H‐index for imaging utilizers: a novel metric for quantifying utilization of emergency department imaging. Acad Emerg Med. 2019; 26(10): 1125‐ 1134.
- 25, , From corpus to classroom: language use and language teaching. Cambridge: Cambridge University Press; 2007.
- 26 Corpus linguistic onomastics: a plea for a corpus‐based investigation of names. Names. 2020; 68(2): 88‐ 103.
- 27. Phraseology in multilingual EU legislation: a corpus‐based study of translated multi‐word terms. Perspect—Stud Transl. https://doi.org/10.1080/0907676X.2020.1800058
- 28, . Academic vocabulary and collocations used in language teaching and applied linguistics textbooks a corpus‐based approach. Terminology. 2020; 26(1): 82‐ 107.
- 29, , Explicitation in children’s literature translated from English to Chinese: a corpus‐based study of personal pronouns. Perspect—Stud Transl. 2020; 28(5): 717‐ 736.
- 30. AntConc (Version 3.5.8). Corpus Software. 2019. https://www.laurenceanthony.net/software/antconc/
- 31 PC analysis of key words—and key key words. System. 1997; 25: 233‐ 245.
- 32, , . Coronavirus infections—more than just the common cold. JAMA—J Am Med Assoc. 2020; 323(8): 707‐ 708.
- 33, , , , Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade‐long structural studies of SARS coronavirus. J Virol. 2020; 94(7): e00127‐ 20.
- 34, , , et al. A novel coronavirus from patients with pneumonia in China, 2019. N Engl J Med. 2020; 382(8): 727‐ 733.
- 35, , , , Chest CT manifestations of new coronavirus disease 2019 (COVID‐19): a pictorial review. Eur Radiol. 2020; 30(8): 4381‐ 4389.
- 36, , , et al. A machine learning model to identify early stage symptoms of SARS‐CoV‐2 infected patients. Expert Syst Appl. 2020; 160:113661113661.
- 37. Relative contributions of transmission routes for COVID‐19 among healthcare personnel providing patient care. J Occup Environ Hyg. 2020; 17(9): 408‐ 415.
- 38, . Airborne transmission of SARS‐CoV‐2: the world should face the reality. Environ Int. 2020; 139:105730.
- 39, , , et al. Airborne transmission route of COVID‐19: why 2 meters/6 feet of inter‐personal distance could not be enough. Int J Environ Res Public Health. 2020; 17(8): 2932.
- 40, , , et al. Demographic characteristics, experiences, and beliefs associated with hand hygiene among adults during the COVID‐19 pandemic—United States, June 24–30, 2020. MMWR—Morb Mortal Wkly Rep. 2020; 69(41): 1485‐ 1491.
- 41, , , , , . Evacuation of quarantine‐qualified nationals from Wuhan for COVID‐19 outbreak—Taiwan experience. J Microbiol Immunol Infect. 2020; 53(3): 392‐ 393.
- 42, , , et al. Individual quarantine versus active monitoring of contacts for the mitigation of COVID‐19: a modelling study. Lancet Infect Dis. 2020; 20(9): 1025‐ 1033.
- 43. High‐frequency words in academic spoken English: corpora and learners. ELT J. 2020; 74(2): 146‐ 155.
- 44, Mutual attraction between high‐frequency verbs and clause types with finite verbs in early positions: corpus evidence from spoken English, Dutch, and German. Lang Cogn Neurosci. 2019; 34(9): 1140‐ 1151.
- 45, , , . Deep sentiment classification and topic discovery on novel coronavirus or COVID‐19 online discussions: NLP using LSTM recurrent neural network approach. IEEE J Biomed Health Inform. 2020; 24(10): 2733‐ 2742.
- 46, , . A novel statistic‐based corpus machine processing approach to refine a big textual data: an ESP case of COVID‐19 news reports. Appl Sci—Basel. 2020; 10(16): 5505.
- 47, , , et al. Research hotspots and trends of bone defects based on Web of Science: a bibliometric analysis. J Orthop Surg Res. 2020; 15(1): 463.
- 48, , , . Documentary analysis of the scientific literature on autism and technology in Web of Science. Brain Sci. 2020; 10(12): 985.
- 49, , , . A comparative analysis of textile schools by journal publications listed in Web of Science (TM). J Text Inst. https://doi.org/10.1080/00405000.2020.1824434