This file was last generated on 2026-05-19 by Bill Nelson 

GENERAL INFORMATION 

1. Title of Dataset: Soil DNA and RNA Virus Sequence Data Recovered Using Different Preparation Methods 

2. Dataset Abstract: 

To enhance detection of communities of DNA and RNA viruses, we applied different preparation methods to soils collected across a moisture gradient from a grassland field experiment. Analyses included metagenomics and metatranscriptomics of size-fractionated extracellular viruses and total soil, total soil metatranscriptomics with polyadenylation enrichment, and metagenomics of bacteria/archaea as well as eukaryote-enriched samples. DNA virome isolation outperformed total soil metagenomes. Contrastingly, RNA virome isolation and total soil metatranscriptomes performed similarly for viral recovery, though RNA viromes yielded higher-quality genomes. Different preparation methods differed in identifying distinct viral communities and predicted host ranges. Relationships between vOTU diversity and expression by moisture varied for both DNA and RNA vOTUs and method. 

3. Principal Researcher:  

Name: Kirsten Hofmockel 

Institution: Pacific Northwest National Laboratory  

Email: kirsten.hofmockel@pnnl.gov 

ORCID: 0000-0003-1586-2167 

4. Additional Author Contact Information 

Name: Josue Rodriguez-Ramos 

Institution: Pacific Northwest National Laboratory  

Email: josue.rodriguez@pnnl.gov 

ORCID: 0000-0002-2049-2765 

5. Information about funding sources supporting the data:  

This program is supported by the U. S. Department of Energy, Office of Science, Office of Biological and Environmental Research, through the Genomic Science Program, under FWP 70880. This research was also performed under the Facilities Integrating Collaborations for User Science (FICUS) program (proposal: 10.46936/fics.proj.2022.60449/60008585) and used resources at the DOE Joint Genome Institute (https://ror.org/04xm1d337) and the Environmental Molecular Sciences Laboratory (https://ror.org/04rc0xn13), which are DOE Office of Science User Facilities operated under Contract Nos. DE-AC02-05CH11231 (JGI) and DE-AC05-76RL01830 (EMSL). 

6. Geographic location of data collection:  

46° 15' 04" N, 119° 43' 43" W 

City:Prosser, state:WA, country:USA 

7. Date of data collection (single date, range, approximate date):  

2022-10-18, 2023-03-07 

  

DATA & FILE OVERVIEW 

1. File List: 

_ReadMe_2026_05_13.txt - This file 

1_master_header_metadata.csv - describes content of other files: variable names (column headers) and descriptive metadata, row counts, missing data indicators, and any specialized formatting used throughout each dataset file. 

SupplementaryTables.zip - supplementary tables from manuscript

jgi_sample_metadata.csv - sequencing sample metadata provided by JGI; 9 columns, 86 rows 

read_mapping_statistics.csv - Statistics of read mapping to assemblies; 5 columns, 86 rows 

sample_metadata.csv - sample processing metadata from lab for extractions; 11 columns, 86 rows 

site_metadata.csv - environmental parameters for the samples collected; 22 columns, 13 rows 

VirStats_RdRP_taxonomy.csv - RNA dependent RNA polymerase taxonomic assignments; 16 columns, 8335 rows  

VirStats_vContact2_output.csv - vContact2 viral taxonomy output; 24 columns, 13316 rows  

VirStats_ViralContigs.csv - Viral contig statistics from CheckV and mvip; 21 columns, 28071 rows  

VirStats_VirHost_DNA.csv - Virus host link results for DNA viruses; 14 columns, 1317 rows  

VirStats_VirHost_RNA.csv - Virus host link results for RNA viruses; 6 columns, 8302 rows  

1. Relationship between files, if important: NA  

2. Additional related data collected that was not included in the current data package: NA 

3. Are there multiple versions of the dataset? No 

 

METHODOLOGICAL INFORMATION  

1. Description of methods used for collection/generation of data:  

Combination of omics-based approaches that are described here: PNNL_soilSFA/interkingdom_virus/tree/main) and laboratory-based approaches detailed below. 

2. Methods for processing the data:  

Field Site Description

Samples were collected from the Tall Wheatgrass Irrigation Field Trial in Prosser, WA, USA (46° 15′ 04″ N and 119° 43′ 43″ W), operated by Washington State University, and described in our previous publications. The site is characterized by marginal Aridisol soils with low organic matter content (<2%), pH of 8, and a sandy loam texture (55.5% sand, 34.1% silt, 10.4% clay). Specifically, it is described as coarse-silty, mixed, superactive, mesic Xeric Haplocambids, with high porosity, permeability, and soil bulk density (avg = 1.56g/cm3). Tall wheatgrass (Thinopyrum ponticum), which is drought-tolerant and adapted for growth on marginal soil, was established in May 2018, prior to which the site was uncultivated desert shrub-steppe. Plants are uniformly distributed within plots. Irrigation treatments have been ongoing since spring 2019. Irrigation is provided through drip lines from April to October, with water supplied at four levels (100%, 75%, 50%, and 25% field water capacity) to create plots with differing water stress based on modeled crop evapotranspiration of tall wheatgrass. Each experimental plot is 2.1 m × 10.7 m with a 1.5 m alley between adjacent plots. (Citations: 10.1128/msystems.00099-24; https://doi.org/10.3389/frmbi.2023.1078024; 10.1128/mSystems.00055-19)   

Soil Collection  

Soil cores were collected on 18 Oct 2022 and 7 March 2023 from plots within the highest (100%) and lowest (25%) irrigation treatments, including 3 field replicates per treatment. All sampled plots were planted with the Alkar cultivar, except one plot with the Jose cultivar (Plot 40, 100% irrigation treatment) which was re-sampled in the spring for consistency, resulting in unequal sample numbers between collection efforts (n = 6 for October 2022, n = 7 for March 2023). Within each plot, one core (5cm diameter) was collected from a random location down to 15 cm depth. The 0-5 cm portion was discarded to remove the surface litter layer. The 5-15 cm portion of each soil core was aseptically broken up and a subsample for Tot RNA was snap frozen in liquid nitrogen (and stored at -80°C prior to RNA extraction). The remaining soil was transported from the field site to the Pacific Northwest National Laboratory (PNNL) on ice for processing. 

In the laboratory, 2mm sieves were used to homogenize soil and remove large roots and rocks prior to subsampling. All subsampling was completed the same day as sample collection. Subsamples for bacterial and archaeal (BAr), fungal (Euk), and viral (Vir) fractionation were stored at 4°C until further processing. Soil for DNA extraction were stored at -80°C until processing.  

The soil water content was measured by the gravimetric method for each sample. Briefly, 10 g of soil was dried at 60°C until a stable weight was achieved. Gravimetric water content (GWC) was calculated as the fresh soil weight minus the dry soil weight, relative to fresh soil weight. pH was determined in 1:2 soil water slurries according to.  

DNA and RNA viral contig identification 

After DNA sequencing at the Joint Genome Institute (JGI) from all different data types, the resulting reads were quality-filtered with the default JGI workflow that uses bbduk  with flags ktrim=r, ordered, minlen=51, minlenfraction=0.33, mink=11, tbo, tpe, rcomp=f, k=23, hdist=1, hdist2=1, ftm=5, pratio=G,C, plen=20, phist, qhist, bhist, and gchist. These reads were downloaded from the JGI data portal and assembled in-house using MEGAHIT with default settings. To identify viral contigs from our assemblies, we used the modular viromics pipeline (MVP) v1.1.1, which uses geNOMAD v1.7.4 [69], CheckV v1.0.3, trimal v1.5.0, mafft v7.526, and FastTree v2.1.1. MVP was run with a minimum scaffold cutoff of 10kb. Viruses identified in each sample were then clustered at 95% ANI across 85% of the shortest contig per MiUViG standards. DNA vOTUs were required to have a viral score of ≥0.7, and no more host genes than viral genes, yielding a per-method clustered database of 19,759 DNA vOTUs that met our quality cutoffs, which were then clustered across all sample types generating 17,590 DNA vOTUs.. 

After RNA sequencing at JGI from all different data types, reads were quality-filtered and assembled with the standard JGI workflow which used MEGAHIT with default settings and flag –k-list 23, 43, 63, 83, 103, 123. Assemblies were downloaded from the JGI data portal, and MVP was run. Given that RNA viruses are often segmented and there is no size consensus for a minimum size, MVP was run with the same settings as above, with the exception that we did not require a minimum size cutoff. RNA viruses were required to have a virus score ≥0.7, no more host genes than viral genes, and required to have an RNA-dependent RNA polymerase (RdRp) gene. The identification of RdRps were taken from MVP which uses HMMER, a default score of 50, and a minimum e-value of 0.01 to search the geNOMAD RdRp database. 

RNA viruses that had an unknown genome type (RNA) or ssRNA genome type and had an RdRp were subject to phylogenetic trees to confirm their taxonomy with the Riboviria database (see section on RNA virus taxonomic assignment). With the confirmed taxonomic assignment, their genome types were corroborated using the latest International Committee on Taxonomy of Viruses (ICTV) master list. Only confirmed ssRNA viruses that had positive or negative genome types with no instance of an ambisense genome by taxonomy were retained for downstream analyses. This yielded a database of 8,302 vOTUs clustered per-method that was used to address the differences in efficiency of RNA methods. Given that using current methods it is impossible to assign transcript abundance information to double stranded RNA (dsRNA) viruses, and to keep the analyses between the ecological and methodological aspects of this manuscript comparable, we did not account for RNA viruses that had dsRNA (515 dsRNA vOTUs). Additionally, viruses that could not confidently be assigned to a genome type were also removed (323 vOTUs). Finally, viral contigs that were opposite to the data type from which they came from were removed (i.e., RNA viruses from DNA methods, or DNA viruses from RNA methods, 8 vOTUs). 

DNA virus taxonomic assignment 

Viral taxonomy for DNA viruses was assessed using vContact2 v0.11.3. Viruses were compared to the reference database of RefSeq v211. Our vOTUs were run through Prodigal v2.6.3 using default settings to identify protein coding genes, resulting in a database of 761,653 genes. Then, vContact2 was run with flags --rel-mode Diamond --db "ProkaryoticViralRefSeq211-Merged" --pcs-mode MCL --vcs-mode ClusterONE --pc-evalue 0.0001 --reported-alignments 25 --max-overlap 0.8 --penalty 2.0 --haircut 0.1 --pc-inflation 2.0 --vc-inflation 2.0 --min-density 0.3 --min-size 2 --vc-overlap 0.9 --vc-penalty 2 --vc-haircut 0.55 --merge-method single --similarity match --seed-method nodes --sig 1.0 --max-sig 300 --mod-inflation 5 --mod-sig 1.0 --mod-shared-min 3 --link-sig 1.0 --link-prop 0.5 --verbose -vv -o ./ -t 32 --c1-bin.  

RNA virus taxonomic assignment 

RNA virus contigs that had an RdRp sequence were assigned taxonomy via phylogenetic trees. To build a reference of RdRps for alignment while also reducing compute times, preliminary taxonomic assignment by MVP was used to download a set of reference sequences from the Riboviria database that matched the MVP-assigned taxonomies. All viral contigs that corresponded to each taxonomic assignment were downloaded. For specifics of which viral contig was placed with which tree, see Table S2. After all corresponding sequences for each phylogenetic group were downloaded, we did a blastp search using the RdRp from our viral contigs as the query, and the entirety of corresponding Riboviria group as the reference and retained the best hit. RdRp sequences from each group were then aligned using MUSCLE5 to the references, resulting in a total of 21 unique alignments. The resulting alignments were then automatically trimmed using trimAl using flag -automated1 and subsequently used to generate a phylogenetic tree using FastTree. Trees were then re-rooted to the specified outgroup using ETE3. ssRNA viral contigs from our dataset were assigned the taxonomic string of the nearest neighbor from the Riboviria reference. 

DNA and RNA viral read mapping for abundance and expression 

The vOTUs from each preparation were clustered across all methods at 95% ANI across 85% of the shortest contig, which generated the 17,590 and 6,005 DNA and RNA vOTU final tables, respectively. To determine whether a virus was recovered (i.e., fully assembled and classified as viral) from each method, the viral clustering output was parsed using a parser available on GitHub (https://github.com/PNNL_soilSFA/interkingdom_virus). If a viral contig was clustered across different methods, it was assigned an overlap count. To identify whether viruses were detected in each preparation method at a read level (e.g., sequencing level), individual reads from each preparation were mapped to the consolidated databases of either DNA (17,590 vOTUs) and RNA vOTUs (6,005 vOTUs). 

For DNA vOTUs, individual preparation reads were mapped using bbmap and filtered to a 98% minimum percent of identity using reformat.sh. After, resulting SAM files were processed through CoverM to apply a 75% minimum coverage and 1x depth cutoff for each hit. A virus was considered to be detected if it had assigned counts at the specified cutoffs. To compare DNA virus diversity trends, reads from each preparation were mapped to a database of vOTUs from each respective preparation. Additionally, Tot DNA reads were mapped to the consolidated database of all 17,590 vOTUs for the “Full” viral diversity that is shown in Figure 8. All mapping is done as specified above using bbmap. To determine DNA virus transcript abundance, total RNA reads were mapped to the consolidated database of 17,590 vOTUs as well as individual vOTU databases from each method by using bbmap and filtered to a 98% minimum percent of identity using reformat.sh. Then, featureCounts was run to estimate read counts per gene region. The resulting transcript abundance values were imported into R and normalized by using a gene length corrected trimmed mean of M-values normalization (geTMM). 

For RNA vOTUs, a TruSeq kit was used which enabled strand-conserved sequencing. Both the strandedness of genes within an RNA virus and their taxonomic assignment are critical for assigning reads as being representative of either abundance or transcript abundance. Further, because taxonomic assignment in RNA viruses depends on the highly conserved RdRp gene, any viruses that did not contain an RdRp with a minimum e-value of 0.1 and a match score of 50 from the MVP RdRp annotation pipeline was subsequently removed from further analyses, and the RdRp of each of those RNA viruses was used as the “genomic representative” for each genome’s ecological patterns. While we recognize that dsRNA viruses are important, only viruses that were ssRNA viruses were used to assign abundance and transcript abundance patterns given that their replication process involves the formation of a complementary strand.  

Leveraging the gene and strandedness assignments from geNOMAD as part of the MVP pipeline, we utilized the MVP “geaParser” to assign read counts to either the coding strand or non-coding strand which reads each BAM file and takes into account whether reads were mapped from either template or non-template strand of our RdRp sequences. A post-processing script then considered the strandedness of the viral gene to determine if mapped reads represented abundance or transcript abundance. ssRNA viruses were considered abundant if recruitment happened in the template strand, and the overall “abundance” was inferred as the total count of reads identified per sample. ssRNA viruses were considered active if recruitment happened on the non-template strand. Given that the significance of higher or lower counts in RNA virus transcript abundance is still debated, RNA viruses that were detected as active were then assigned their respective abundance value for each sample as a proxy for their transcript abundance as has been done previously. RNA viruses that were identified as potentially active but not abundant were removed from subsequent analyses (113 vOTUs) as they likely represented viruses that did not have reads recruiting to the expected strands due to the mapping cutoffs used.  

Moisture was measured as described above and used as a continuous variable for statistics to relate the viral diversity and expression metrics detailed above. A lm function was run using R base stats. 

Virus host prediction 

Host assignment for DNA vOTUs were performed using iPHoP version 1.4 run with default settings with the Jun25_rw database. iPHoP results were then processed using a custom parser that is available on GitHub that only reports the hit(s) that had the highest confidence score per each virus and ensures a minimum confidence score of 90. Host assignment for RNA vOTUs were performed using RNAVirHost and using the taxonomic assignments that were generated with this manuscript. Hits were only considered if they were determined by RNAVirHost to be “high confidence” or “assigned”.  

3. Instrument- or software-specific information needed to interpret the data:  

https://github.com/PNNL-SoilSFA/interkingdom_virus 

4. Standards and calibration information, if appropriate: listed above. 

5. Environmental/experimental conditions: listed above. 

6. Describe any quality-assurance procedures performed on the data: listed above and See https://github.com/PNNL-SoilSFA/interkingdom_virus for data curation and analysis processes. 

7. People involved with sample collection, processing, analysis and/or submission: Josué A. Rodríguez-Ramos, Amy E. Zimmerman, Ruonan Wu, Sheryl Bell, Trinidad Alfaro, Nicholas Reichart, Kirsten S. Hofmockel, William C. Nelson 


SHARING/ACCESS INFORMATION 

1. Licenses/restrictions placed on the data: This work is marked with CC0 1.0: https://creativecommons.org/publicdomain/zero/1.0/. The authors do request that you appropriately cite the dataset when referencing or re-using the dataset. 

2. Links to publications that cite or use the data: NA 

3. Links to other publicly accessible locations of the data: NA   

4. Links/relationships to ancillary data sets: NA  

5. Was data derived from another source? no   

6. Recommended citation for this dataset:  

Josué A. Rodríguez-Ramos, Amy E. Zimmerman, Ruonan Wu, Sheryl Bell, Trinidad Alfaro, Kirsten Hofmockel, William C. Nelson. Soil DNA and RNA Virus Sequence Data Recovered Using Different Preparation Methods. https://doi.org/10.25584/2583337 