# Data to accompany "Assessing Degenerate Peptide Resolution Methods using a Ground Truth Dataset"

#### Beata J Meluch, 2025-11-02

This data package contains processed LC-MS proteomics results and analysis 
scripts associated with the paper "Assessing Degenerate Peptide Resolution 
Methods using a Ground Truth Dataset". 

## Folder Contents
```
|___Data
|   |__4998_Ground_Truth_Dataset
|   |  |__FASTAs: 11 FASTA files digested in silico to create the reference 
|   |  |          library.
|   |  |__e_data_4998.csv: Peptide expression data by sample.
|   |  |__e_meta_4998.csv: Peptide metadata, including protein mappings.
|   |  |__f_data_4998.csv: Experiment metadata, including sample groupings.
|   |  |__scores_4998.csv: Peptide identification confidence scores from the 
|   |  |                   search tool, MS-GF+.
|   |  |__msnid_4998.RDS:  Peptide identifications in MSnID format, saved as 
|   |                      an R data object.
|   |__5765_Validation_Dataset
|      |__(the same files as above, for the validation dataset)
|
|___Code
    |__peptide_analysis_figures_all_BJM_Oct2025.Rmd: Script used for analysis 
    |  and figure generation.
    |__R_session_info.txt: Environment information, including software package 
    |  versions.
    |__Rmarkdown_complete.Rdata: R environment containing all objects present 
       after running the analysis script. Included to avoid rerunning code 
       chunks that take several hours.
```

## Data Generation Methods

### Sample Preparation

#### "4998_Ground_Truth_Dataset"
Proteins were extracted from lab cultures of the organisms listed in "FASTAs" using methanol/chloroform separation. Extracted proteins were digested using trypsin at a 1:50 enzyme-to-protein ratio. Digested samples were cleaned up via solid phase extraction on a C18 column. Digested peptides from each organism were dried and resuspended in 150uL of 5% acetonitrile. Peptide concentration was determined using a bicinchoninic acid assay according to manufacturer instructions. Peptides from each organism were added for a total of 32ug of peptides in each mixture. Water was added to bring each mixture to 80uL. The five mixtures were separated into six fractions and run in triplicate for a total of 90 LC-MS/MS runs. 

#### "5765_Validation_Dataset"
Samples were prepared as described above for 4998 but did not undergo SPE or fractionation. Samples were run in triplicate for a total of 12 LC-MS/MS runs.

### LC-MS/MS Analysis

#### "4998_Ground_Truth_Dataset"
Fractionated samples were separated by LC using a Thermo Dionex Ultimate 3000 with autosampler (Thermo Fisher Scientific, Waltham, MA) coupled to a handbuilt 75um x 30cm Waters 1.75um particle BEH C18 (Waters Corp, Milford, MA) column with a 120 minute formic acid separation and run on the ThermoFisher Q Exactive HF-X. 

#### "5765_Validation_Dataset"
Samples were separated by LC using a Waters nanoAcuity with autosampler (Waters Corp, Milford, MA). Samples first underwent solid phase extraction on a 4cm x 100um, 5um pore Jupiter C18 trapping column (Phenomenex, Torrance, CA) connected to a handbuilt 75um x 25cm Waters 1.7um particle BEH C18 (Waters Corp, Milford, MA) with a 120 minute formic acid separation and were then run on the Thermo Fisher Orbitrap Eclipse Tribrid mass spectrometer.

### Spectrum Data Processing (both)
Peptide identification was performed using MS-GF+ v2023.01.12. MS-GF+ results were processed into three CSV files using TMTPipeline  to prepare the data for analysis using the pmartR package: one file containing abundance data for each peptide (“e_data”), a second file with sample grouping information (“f_data”), and a third file containing all identified protein matches for all peptides (“e_meta”).


## Funding 
This research was supported by the National Microbiome Data Collaborative, an initiative of the Genomic Science Program in the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER) under contract number DE-AC05-76RL01830 (PNNL). Data were generated on project award 60205 (https://doi.org/10.46936/staf.proj.2021.60205/60006945) from the Environmental Molecular Sciences Laboratory, a DOE Office of Science User Facility sponsored by the Biological and Environmental Research program under Contract No. DE-AC05-76RL01830.

## License
This data product is licensed under a Creative Commons Zero (“CC0”) Public Domain Dedication Waiver (https://creativecommons.org/publicdomain/zero/1.0/) in accordance with PNNL DataHub policy (https://data.pnnl.gov/policy).

## Contact
beata.meluch@pnnl.gov