Title: | Proteomics Data Analysis Functions |
---|---|
Description: | Data analysis of proteomics experiments by mass spectrometry is supported by this collection of functions mostly dedicated to the analysis of (bottom-up) quantitative (XIC) data. Fasta-formatted proteomes (eg from UniProt Consortium <doi:10.1093/nar/gky1049>) can be read with automatic parsing and multiple annotation types (like species origin, abbreviated gene names, etc) extracted. Initial results from multiple software for protein (and peptide) quantitation can be imported (to a common format): MaxQuant (Tyanova et al 2016 <doi:10.1038/nprot.2016.136>), Dia-NN (Demichev et al 2020 <doi:10.1038/s41592-019-0638-x>), Fragpipe (da Veiga et al 2020 <doi:10.1038/s41592-020-0912-y>), ionbot (Degroeve et al 2021 <doi:10.1101/2021.07.02.450686>), MassChroq (Valot et al 2011 <doi:10.1002/pmic.201100120>), OpenMS (Strauss et al 2021 <doi:10.1038/nmeth.3959>), ProteomeDiscoverer (Orsburn 2021 <doi:10.3390/proteomes9010015>), Proline (Bouyssie et al 2020 <doi:10.1093/bioinformatics/btaa118>), AlphaPept (preprint Strauss et al <doi:10.1101/2021.07.23.453379>) and Wombat-P (Bouyssie et al 2023 <doi:10.1021/acs.jproteome.3c00636>. Meta-data provided by initial analysis software and/or in sdrf format can be integrated to the analysis. Quantitative proteomics measurements frequently contain multiple NA values, due to physical absence of given peptides in some samples, limitations in sensitivity or other reasons. Help is provided to inspect the data graphically to investigate the nature of NA-values via their respective replicate measurements and to help/confirm the choice of NA-replacement algorithms. Meta-data in sdrf-format (Perez-Riverol et al 2020 <doi:10.1021/acs.jproteome.0c00376>) or similar tabular formats can be imported and included. Missing values can be inspected and imputed based on the concept of NA-neighbours or other methods. Dedicated filtering and statistical testing using the framework of package 'limma' <doi:10.18129/B9.bioc.limma> can be run, enhanced by multiple rounds of NA-replacements to provide robustness towards rare stochastic events. Multi-species samples, as frequently used in benchmark-tests (eg Navarro et al 2016 <doi:10.1038/nbt.3685>, Ramus et al 2016 <doi:10.1016/j.jprot.2015.11.011>), can be run with special options considering such sub-groups during normalization and testing. Subsequently, ROC curves (Hand and Till 2001 <doi:10.1023/A:1010920819831>) can be constructed to compare multiple analysis approaches. As detailed example the data-set from Ramus et al 2016 <doi:10.1016/j.jprot.2015.11.011>) quantified by MaxQuant, ProteomeDiscoverer, and Proline is provided with a detailed analysis of heterologous spike-in proteins. |
Authors: | Wolfgang Raffelsberger [aut, cre] |
Maintainer: | Wolfgang Raffelsberger <[email protected]> |
License: | GPL-3 |
Version: | 1.13.0 |
Built: | 2025-02-17 03:15:46 UTC |
Source: | https://github.com/cran/wrProteo |
This fuction returns the molecular mass based of main elements found in biology/proteomics as average and mono-isotopic mass. The result includes H, C, N, O, P, S, Se and the electrone. The values are bsed on http://www.ionsource.com/Card/Mass/mass.htm in ref to http://physics.nist.gov/Comp (as of 2019).
.atomicMasses()
.atomicMasses()
This function returns a numeric matrix with mass values
.atomicMasses()
.atomicMasses()
This function allows checking presence of knitr and rmarkdown
.checkKnitrProt(tryF = FALSE)
.checkKnitrProt(tryF = FALSE)
tryF |
(logical) |
This function returns a logical value
.checkKnitrProt()
.checkKnitrProt()
This (low-level) function performs an additional/final chek & adjustments to sample-names after readSampleMetaData()
.checkSetupGroups( abund, setupSd, gr = NULL, sampleNames = NULL, quantMeth = NULL, silent = FALSE, callFrom = NULL, debug = FALSE )
.checkSetupGroups( abund, setupSd, gr = NULL, sampleNames = NULL, quantMeth = NULL, silent = FALSE, callFrom = NULL, debug = FALSE )
abund |
(matrix or data.frame) abundance data, only the colnames will be used |
setupSd |
(list) describing sammple-setup, typically produced by from package wrMisc |
gr |
(factor) optional custom information about replicate-layout, has priority over setuoSd |
sampleNames |
(character) custom sample-names, has priority over abund and setuoSd |
quantMeth |
(character) 2-letter abbreviation of name of quantitation-software (eg 'MQ') |
silent |
(logical) suppress messages |
callFrom |
(character) allow easier tracking of messages produced |
debug |
(logical) display additional messages for debugging |
This function returns an enlaged/updated list 'setupSd' (set setupSd$sampleNames, setupSd$groups)
used in readProtDiscovererFile
, readMaxQuantFile
, readProlineFile
, readFragpipeFile
set.seed(2021)
set.seed(2021)
This (low-level) function allows accessing matrix with UniProt abbreviations for species frequently used in research. This information may be used to harmonize species descriptions or extract species information out of protein-names.
.commonSpecies()
.commonSpecies()
This function returns a 2-column matrix with species names
used eg in readProtDiscovererFile
, readMaxQuantFile
, readProlineFile
, readFragpipeFile
.commonSpecies()
.commonSpecies()
This (low-level) function creates the column annot[,'SpecType'] which may help distinguishing different lines/proteins.
This information may, for example, be used to normalize only to all proteins of a common backgroud matrix (species).
In order to compare specPref
a species-column will be added to the annotation (annot
) - if not already present
If $mainSpecies or $conta: match to annot[,"Species"], annot[,"EntryName"], annot[,"GeneName"], if length==1 grep in annot[,"Species"]
.extrSpecPref( specPref, annot, useColumn = c("Species", "EntryName", "GeneName", "Accession"), suplInp = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
.extrSpecPref( specPref, annot, useColumn = c("Species", "EntryName", "GeneName", "Accession"), suplInp = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
specPref |
(list) may contain $mainSpecies, $conta ... |
annot |
(matrix) main protein annotation |
useColumn |
(factor) columns from annot to use/mine |
suplInp |
(matrix) additional custom annotation |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging (starting with 'mainSpecies','conta' and others - later may overwrite prev settings) |
callFrom |
(character) allow easier tracking of messages produced |
Different to readSampleMetaData
this function also considers the main annotation as axtracted with main quantification data.
For example, this function can complement protein annotation data if columns 'Accession','EntryName' or 'SpecType' are missing
This function returns a matrix with additional column 'SpecType'
used in readProtDiscovererFile
, readMaxQuantFile
, readProlineFile
, readFragpipeFile
annot1 <- cbind( Leading.razor.protein=c("sp|P00925|ENO2_YEAST", "sp|Q3E792|RS25A_YEAST", "sp|P09938|RIR2_YEAST", "sp|P09938|RIR2_YEAST", "sp|Q99186|AP2M_YEAST", "sp|P00915|CAH1_HUMAN"), Species= rep(c("Saccharomyces cerevisiae","Homo sapiens"), c(5,1))) specPref1 <- list(conta="CON_|LYSC_CHICK", mainSpecies="OS=Saccharomyces cerevisiae", spike="P00915") # MQ type .extrSpecPref(specPref1, annot1, useColumn=c("Species","Leading.razor.protein"))
annot1 <- cbind( Leading.razor.protein=c("sp|P00925|ENO2_YEAST", "sp|Q3E792|RS25A_YEAST", "sp|P09938|RIR2_YEAST", "sp|P09938|RIR2_YEAST", "sp|Q99186|AP2M_YEAST", "sp|P00915|CAH1_HUMAN"), Species= rep(c("Saccharomyces cerevisiae","Homo sapiens"), c(5,1))) specPref1 <- list(conta="CON_|LYSC_CHICK", mainSpecies="OS=Saccharomyces cerevisiae", spike="P00915") # MQ type .extrSpecPref(specPref1, annot1, useColumn=c("Species","Leading.razor.protein"))
This (lower-level) function allows to perfom the basic NA-imputaton.
Note, at this point the information from argument gr
is not used.
.imputeNA( dat, gr = NULL, impParam, exclNeg = TRUE, inclLowValMod = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
.imputeNA( dat, gr = NULL, impParam, exclNeg = TRUE, inclLowValMod = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(matrix or data.frame) main data (may contain |
gr |
(character or factor) grouping of columns of |
impParam |
(numeric) 1st for mean; 2nd for sd; 3rd for seed |
exclNeg |
(logical) exclude negative |
inclLowValMod |
(logical) label on x-axis on plot |
silent |
(logical) suppress messages |
debug |
(logical) supplemental messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function returns a list with $data
and $datImp
for more complex treatment matrixNAneighbourImpute
;
dat1 <- matrix(11:22, ncol=4) dat1[3:4] <- NA .imputeNA(dat1, impParam=c(mean(dat1, na.rm=TRUE), 0.1))
dat1 <- matrix(11:22, ncol=4) dat1[3:4] <- NA .imputeNA(dat1, impParam=c(mean(dat1, na.rm=TRUE), 0.1))
This (low-level) function allows (generic) plotting of density distribution for quantitation import-functions
.plotQuantDistr( abund, quant, custLay = NULL, normalizeMeth = NULL, softNa = NULL, refLi = NULL, refLiIni = NULL, notLogAbund = NA, figMarg = c(3.5, 3.5, 3, 1), tit = NULL, las = NULL, cexAxis = 0.8, nameSer = NULL, cexNameSer = NULL, silent = FALSE, callFrom = NULL, debug = FALSE )
.plotQuantDistr( abund, quant, custLay = NULL, normalizeMeth = NULL, softNa = NULL, refLi = NULL, refLiIni = NULL, notLogAbund = NA, figMarg = c(3.5, 3.5, 3, 1), tit = NULL, las = NULL, cexAxis = 0.8, nameSer = NULL, cexNameSer = NULL, silent = FALSE, callFrom = NULL, debug = FALSE )
abund |
(matrix or data.frame) abundance data, will be plottes as distribution |
quant |
(matrix or data.frame) optional additional abundance data, to plot 2nd distribution, eg of normalized data |
custLay |
(matrix) describing sammple-setup, typically produced by |
normalizeMeth |
(character, length=1) name of normalization method (will be displayed in title of figure) |
softNa |
(character, length=1) name of quantitation-software (typically 2-letter abbreviation, eg 'MQ') |
refLi |
(integer) to display number reference lines |
refLiIni |
(integer) to display initial number reference lines |
notLogAbund |
(logical) set to |
figMarg |
(numeric, length=4) custom figure margins (will be passed to |
tit |
(character) custom title |
las |
(integer) indicate orientation of text in axes |
cexAxis |
(numeric) size of numeric axis labels as cex-expansion factor (see also |
nameSer |
(character) custom label for data-sets or columns (length must match number of data-sets) |
cexNameSer |
(numeric) size of individual data-series labels as cex-expansion factor (see also |
silent |
(logical) suppress messages |
callFrom |
(character) allow easier tracking of messages produced |
debug |
(logical) display additional messages for debugging |
This function returns logical value (if data were valid for plotting) and produces a density dustribution figure (if data were found valid)
used in readProtDiscovererFile
, readMaxQuantFile
, readProlineFile
, readFragpipeFile
set.seed(2018); datT8 <- matrix(round(rnorm(800) +3,1), nc=8, dimnames=list(paste( "li",1:100,sep=""), paste(rep(LETTERS[1:3],c(3,3,2)),letters[18:25],sep=""))) .plotQuantDistr(datT8, quant=NULL, refLi=NULL, tit="Synthetic Data Distribution")
set.seed(2018); datT8 <- matrix(round(rnorm(800) +3,1), nc=8, dimnames=list(paste( "li",1:100,sep=""), paste(rep(LETTERS[1:3],c(3,3,2)),letters[18:25],sep=""))) .plotQuantDistr(datT8, quant=NULL, refLi=NULL, tit="Synthetic Data Distribution")
Calculate molecular mass based on atomic composition
AAmass(massTy = "mono", inPept = TRUE, inclSpecAA = FALSE)
AAmass(massTy = "mono", inPept = TRUE, inclSpecAA = FALSE)
massTy |
(character) 'mono' or 'average' |
inPept |
(logical) remove H20 corresponding to water loss at peptide bond formaton |
inclSpecAA |
(logical) include ornithine O & selenocysteine U |
This function returns a vector with masses for all amino-acids (argument 'massTy' to switch from mono-isotopic to average mass)
massDeFormula(c("12H12O","HO"," 2H 1 Se, 6C 2N","HSeCN"," ","e")) AAmass()
massDeFormula(c("12H12O","HO"," 2H 1 Se, 6C 2N","HSeCN"," ","e")) AAmass()
This function calculates the AUC (area under the curve) from ROC data in matrix of specificity and sensitivity values,
as provided in the output from summarizeForROC
.
AucROC( dat, useCol = c("spec", "sens"), returnIfInvalid = NA, silent = FALSE, debug = FALSE, callFrom = NULL )
AucROC( dat, useCol = c("spec", "sens"), returnIfInvalid = NA, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(matrix or data.frame) main inut containig sensitivity and specificity data (from |
useCol |
(character or integer) column names to be used: 1st for specificity and 2nd for sensitivity count columns |
returnIfInvalid |
( |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
This function returns a matrix including imputed values or list of final and matrix with number of imputed by group (plus optional plot)
preparing ROC data summarizeForROC
, (re)plot the ROC figure plotROC
;
note that numerous other packages also provide support for working with ROC-curves : Eg rocPkgShort,
ROCR, pROC or ROCit, etc.
set.seed(2019); test1 <- list(annot=cbind(Species=c(rep("b",35), letters[sample.int(n=3, size=150,replace=TRUE)])), BH=matrix(c(runif(35,0,0.01), runif(150)), ncol=1)) roc1 <- summarizeForROC(test1, spec=c("a","b","c"), annotCol="Species") AucROC(roc1)
set.seed(2019); test1 <- list(annot=cbind(Species=c(rep("b",35), letters[sample.int(n=3, size=150,replace=TRUE)])), BH=matrix(c(runif(35,0,0.01), runif(150)), ncol=1)) roc1 <- summarizeForROC(test1, spec=c("a","b","c"), annotCol="Species") AucROC(roc1)
This function allows to manipulate sample-names (ie colnames of abundance data) in a batch-wise manner from data stored as multiple matrixes or data.frames of a list.
Import functions such as readMaxQuantFile()
organize initial flat files into lists (of matrixes) of the different types of data.
Many times all column names in such lists carry long names including redundant information, like the overall experiment name or date, etc.
The aim of this function is to facilitate 'cleaning' the sample- (ie column-) names to obtain short and concise names.
Character terms to be removed (via argument rem
) and/or replaced/subsitituted (via argument subst
) should be given as they are, characters with special behaviour in grep
(like '.') will be protected internally.
Note, that the character substitution part will be done first, and the removal part (without character replacement) afterwards.
cleanListCoNames( dat, rem = NULL, subst = c("-", "_"), lstE = c("raw", "quant", "counts"), mathOper = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
cleanListCoNames( dat, rem = NULL, subst = c("-", "_"), lstE = c("raw", "quant", "counts"), mathOper = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(list) main input |
rem |
(character) character string to be removed, may be named 'left' and 'right' for more specific exact pattern matching
(this part will be perfomed before character substitutions by |
subst |
(character of length=2, or matrix with 2 columns) pair(s) of character-strings for replacement (1st as search-item and 2nd as replacement); this part is performed after character-removal via |
lstE |
(character, length=1) names of list-elements where colnames should be cleaned |
mathOper |
(character, length=1) optional mathematical operation on numerical part of sample-names (eg |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function returns a list (equivalent to input dat
)
dat1 <- matrix(1:12, ncol=4, dimnames=list(1:3, paste0("sample_R.",1:4))) dat1 <- list(raw=dat1, quant=dat1, notes="other..") cleanListCoNames(dat1, rem=c(left="sample_"), c(".","-"))
dat1 <- matrix(1:12, ncol=4, dimnames=list(1:3, paste0("sample_R.",1:4))) dat1 <- list(raw=dat1, quant=dat1, notes="other..") cleanListCoNames(dat1, rem=c(left="sample_"), c(".","-"))
In most omics data-analysis one needs to employ a certain number of filtering strategies to avoid getting artifacts to the step of statistical testing.
combineMultFilterNAimput
takes on one side the origial data and on the other side NA-imputed data to create several differnet filters and to finally combine them.
A filter aiming to take away the least abundant values (using the imputede data) can be fine-tuned by the argument abundThr
.
This step compares the means for each group and line, at least one grou-mean has to be > the threshold (based on hypothesis
that if all conditions represent extrememy low measures their diffrenetial may not be determined with certainty).
In contrast, the filter addressing the number of missing values (NA
) uses the original data, the arguments colTotNa
,minSpeNo
and minTotNo
are used at this step. Basically, this step allows defining a minimum content of 'real' (ie non-NA) values for further considering the measurements as reliable.
This part uses internally presenceFilt
for filtering elevated content of NA
per line.
Finally, this function combines both filters (as matrix of FALSE
and TRUE
) on NA-imputed and original data
and retruns a vector of logical values if corresponding lines passe all filter criteria.
combineMultFilterNAimput( dat, imputed, grp, annDat = NULL, abundThr = NULL, colRazNa = NULL, colTotNa = NULL, minSpeNo = 1, minTotNo = 2, silent = FALSE, debug = FALSE, callFrom = NULL )
combineMultFilterNAimput( dat, imputed, grp, annDat = NULL, abundThr = NULL, colRazNa = NULL, colTotNa = NULL, minSpeNo = 1, minTotNo = 2, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(matrix or data.frame) main data (may contain |
imputed |
(character) same as 'dat' but with all |
grp |
(character or factor) define groups of replicates (in columns of 'dat') |
annDat |
(matrix or data.frame) annotation data (should match lines of 'dat') |
abundThr |
(numeric) optional threshold filter for minimumn abundance |
colRazNa |
(character) if razor peptides are used: column name for razor peptide count |
colTotNa |
(character) column name for total peptide count |
minSpeNo |
(integer) minimum number of specific peptides for maintaining proteins |
minTotNo |
(integer) minimum total ie max razor number of peptides |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
This function returns a vector of logical values if corresponding line passes filter criteria
set.seed(2013) datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste0("li",1:50), letters[19:24])) datT6 <- datT6 +matrix(rep(1:nrow(datT6),ncol(datT6)), ncol=ncol(datT6)) datT6[6:7,c(1,3,6)] <- NA datT6[which(datT6 < 11 & datT6 > 10.5)] <- NA datT6[which(datT6 < 6 & datT6 > 5)] <- NA datT6[which(datT6 < 4.6 & datT6 > 4)] <- NA datT6b <- matrixNAneighbourImpute(datT6, gr=gl(2,3)) datT6c <- combineMultFilterNAimput(datT6, datT6b, grp=gl(2,3), abundThr=2)
set.seed(2013) datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste0("li",1:50), letters[19:24])) datT6 <- datT6 +matrix(rep(1:nrow(datT6),ncol(datT6)), ncol=ncol(datT6)) datT6[6:7,c(1,3,6)] <- NA datT6[which(datT6 < 11 & datT6 > 10.5)] <- NA datT6[which(datT6 < 6 & datT6 > 5)] <- NA datT6[which(datT6 < 4.6 & datT6 > 4)] <- NA datT6b <- matrixNAneighbourImpute(datT6, gr=gl(2,3)) datT6c <- combineMultFilterNAimput(datT6, datT6b, grp=gl(2,3), abundThr=2)
This function calculates the molecular mass of one-letter code amion-acid sequences.
convAASeq2mass( x, massTy = "mono", seqName = TRUE, silent = FALSE, callFrom = NULL )
convAASeq2mass( x, massTy = "mono", seqName = TRUE, silent = FALSE, callFrom = NULL )
x |
(character) aminoacid sequence (single upper case letters for describing a peptide/protein) |
massTy |
(character) default 'mono' for mono-isotopic masses (alternative 'average') |
seqName |
(logical) optional (alternative) names for the content of 'x' (ie aa seq) as name (always if 'x' has no names) |
silent |
(logical) suppress messages |
callFrom |
(character) allow easier tracking of message(s) produced |
This functions returns a vector with masses for all amino-acids (argument 'massTy' to switch form mono-isotopic to average mass)
massDeFormula
, AAmass
, convToNum
convAASeq2mass(c("PEPTIDE","fPROTEINES")) pep1 <- c(aa="AAAA", de="DEFDEF") convAASeq2mass(pep1, seqN=FALSE)
convAASeq2mass(c("PEPTIDE","fPROTEINES")) pep1 <- c(aa="AAAA", de="DEFDEF") convAASeq2mass(pep1, seqN=FALSE)
This function orders columns in list of matrixes (or matrix) according to argument sampNames
and also offers an option for changing names of columns.
It was (initially) designed to adjust/correct the order of samples after import using readMaxQuantFile()
, readProteomeDiscovererFile()
etc.
The input may also be MArrayLM-type object from package limma or
from functions moderTestXgrp
or moderTest2grp
.
corColumnOrder( dat, sampNames, replNames = NULL, useListElem = c("quant", "raw", "counts"), annotElem = "sampleSetup", newNames = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
corColumnOrder( dat, sampNames, replNames = NULL, useListElem = c("quant", "raw", "counts"), annotElem = "sampleSetup", newNames = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(matrix, list or MArrayLM-object from limma) main input of which columns should get re-ordered, may be output from |
sampNames |
(character) column-names in desired order for output (its content must match colnames of |
replNames |
(character) option for replacing column-names by new/different colnames; should be vector of NEW column-names (in order as input from |
useListElem |
(character) in case |
annotElem |
(character) name of list-element of |
newNames |
depreciated, pleqse use |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
This function returns an object of same class as input dat
(ie matrix, list or MArrayLM-object from limma)
readMaxQuantFile
, readProteomeDiscovererFile
; moderTestXgrp
or moderTest2grp
grp <- factor(rep(LETTERS[c(3,1,4)], c(2,3,3))) dat1 <- matrix(1:15, ncol=5, dimnames=list(NULL,c("D","A","C","E","B"))) corColumnOrder(dat1, sampNames=LETTERS[1:5]) dat2 <- list(quant=dat1, raw=dat1) dat2 corColumnOrder(dat2, sampNames=LETTERS[1:5]) corColumnOrder(dat2, sampNames=LETTERS[1:5], replNames=c("Dd","Aa","Cc","Ee","Bb"))
grp <- factor(rep(LETTERS[c(3,1,4)], c(2,3,3))) dat1 <- matrix(1:15, ncol=5, dimnames=list(NULL,c("D","A","C","E","B"))) corColumnOrder(dat1, sampNames=LETTERS[1:5]) dat2 <- list(quant=dat1, raw=dat1) dat2 corColumnOrder(dat2, sampNames=LETTERS[1:5]) corColumnOrder(dat2, sampNames=LETTERS[1:5], replNames=c("Dd","Aa","Cc","Ee","Bb"))
Compare in-silico digested proteomes for unique and shared peptides, counts per protein or as peptides
Compare in-silico digested proteomes for unique and shared peptides, counts per protein or as peptides. The in-silico digestion may be performed separately using the package cleaver. Note: input must be list (or multiple names lists) of proteins with their respective peptides (eg by in-silico digestion).
countNoOfCommonPeptides( ..., prefix = c("Hs", "Sc", "Ec"), sep = "_", silent = FALSE, debug = FALSE, callFrom = NULL )
countNoOfCommonPeptides( ..., prefix = c("Hs", "Sc", "Ec"), sep = "_", silent = FALSE, debug = FALSE, callFrom = NULL )
... |
(list) multiple lists of (ini-silico) digested proteins (typically protein ID as names) with their respectice peptides (AA sequence), one entry for each species |
prefix |
(character) optional (species-) prefix for entries in '...', will be only considered if '...' has no names |
sep |
(character) concatenation symbol |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
This function returns a list with $byPep as list of logical matrixes for each peptide (as line) and unique/shared/etc for each species; $byProt as list of matrixes with count data per proten (as line) for each species; $tab with simple summary-type count data
readFasta2
and/or cleave-methods
in package cleaver
## The example mimics a proteomics experiment where extracts form E coli and ## Saccharomyces cerevisiae were mixed, thus not all peptdes may occur unique. (mi2 = countNoOfCommonPeptides(Ec=list(E1=letters[1:4],E2=letters[c(3:7)], E3=letters[c(4,8,13)],E4=letters[9]),Sc=list(S1=letters[c(2:3,6)], S2=letters[10:13],S3=letters[c(5,6,11)],S4=letters[c(11)],S5="n"))) ## a .. uni E, b .. inteR, c .. inteR(+intra E), d .. intra E (no4), e .. inteR, ## f .. inteR +intra E (no6), g .. uni E, h .. uni E no 8), i .. uni E, ## j .. uni S (no10), k .. intra S (no11), l .. uni S (no12), m .. inteR (no13) lapply(mi2$byProt,head) mi2$tab
## The example mimics a proteomics experiment where extracts form E coli and ## Saccharomyces cerevisiae were mixed, thus not all peptdes may occur unique. (mi2 = countNoOfCommonPeptides(Ec=list(E1=letters[1:4],E2=letters[c(3:7)], E3=letters[c(4,8,13)],E4=letters[9]),Sc=list(S1=letters[c(2:3,6)], S2=letters[10:13],S3=letters[c(5,6,11)],S4=letters[c(11)],S5="n"))) ## a .. uni E, b .. inteR, c .. inteR(+intra E), d .. intra E (no4), e .. inteR, ## f .. inteR +intra E (no6), g .. uni E, h .. uni E no 8), i .. uni E, ## j .. uni S (no10), k .. intra S (no11), l .. uni S (no12), m .. inteR (no13) lapply(mi2$byProt,head) mi2$tab
This function allows exporting objects created from wrProteo to the format of Wombat-P Wombat-P.
exportAsWombatP( wrProtObj, path = ".", combineFractions = "mean", silent = FALSE, debug = FALSE, callFrom = NULL )
exportAsWombatP( wrProtObj, path = ".", combineFractions = "mean", silent = FALSE, debug = FALSE, callFrom = NULL )
wrProtObj |
(list produced by any import-function from wrProteo) object which will be exported as Wombat-P format |
path |
(character) the location where the data should be exorted to |
combineFractions |
( |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
This function creates a set of files (README.md
, test_params.yml
), plus a sud-directory containig file(s) (stand_prot_quant_method.csv
); finally the function returns (NULL
),
readMaxQuantFile
, readProteomeDiscovererFile
; moderTestXgrp
or moderTest2grp
path1 <- system.file("extdata", package="wrProteo") fiNa <- "proteinGroupsMaxQuant1.txt.gz" specPr <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spike="HUMAN_UPS") dataMQ <- readMaxQuantFile(path1, file=fiNa, specPref=specPr, tit="tiny MaxQuant") exportAsWombatP(dataMQ, path=tempdir())
path1 <- system.file("extdata", package="wrProteo") fiNa <- "proteinGroupsMaxQuant1.txt.gz" specPr <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spike="HUMAN_UPS") dataMQ <- readMaxQuantFile(path1, file=fiNa, specPref=specPr, tit="tiny MaxQuant") exportAsWombatP(dataMQ, path=tempdir())
Sample/experimental annotation meta-data form MaxQuant that was previously import can now be formatted in sdrf-style and exported using this function to write a draft-sdrf-file. Please note that this information will not _complete_ in respect to all information used in data-bases like Pride. Sdrf-files provide additional meta-information about samles and MS-runs in a standardized format, they may also be part of submissions to Pride.
exportSdrfDraft( lst, fileName = "sdrfDraft.tsv", correctFileExtension = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
exportSdrfDraft( lst, fileName = "sdrfDraft.tsv", correctFileExtension = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
lst |
(list) object created by import-function (MaxQuant) |
fileName |
(character) file-name (and path) to be used when exprting |
correctFileExtension |
(logical) if |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
Gathering as much as possible information about samples and MS-runs requires that the additional files created from software, like MaxQuant using readMaxQuantFile
,
is present and was imported when calling the import-function (eg using the argument _suplAnnotFile=TRUE_).
Please note that this functionality was designed for the case where no (external) sdrf-file is available.
Thus, when data was imported including exteranl sdrf (uinsg the _sdrf=_ argument), exporting incomplete annotation-data from MaxQuant-produced files does not make any sense and therefore won't be possible.
After exporting the draft sdrf the user is advised to check and complete the information in the resulting file.
Unfortunately, not all information present in a standard sdrf-file (like on Pride) cannot be gathered automatically,
but key columns are already present and thus may facilitate completing.
Please note, that the file-format has been defined as .tsv
, thus columns/fields should be separated by tabs.
At manual editing and completion, some editing- or tabulator-software may change the file-extesion to .tsv.txt
,
in this case the final files should be renamed as .tsv
to remain compatible with Pride.
At this point only the import of data from MaxQuant via readMaxQuantFile
has been developed to extract information for creating a draft-sdrf.
Other data/file-import functions may be further developed to gather as much as possible equivalent information in the future.
This function writes an Sdrf draft to file
This function may be used after reading/importig data by readMaxQuantFile
in absence of sdrf
path1 <- system.file("extdata", package="wrProteo") fiNaMQ <- "proteinGroups.txt.gz" dataMQ <- readMaxQuantFile(path1, file=fiNaMQ, refLi="mainSpe", sdrf=FALSE, suplAnnotFile=TRUE) ## Here we'll write simply in the current temporary directory of this R-session exportSdrfDraft(dataMQ, file.path(tempdir(),"testSdrf.tsv"))
path1 <- system.file("extdata", package="wrProteo") fiNaMQ <- "proteinGroups.txt.gz" dataMQ <- readMaxQuantFile(path1, file=fiNaMQ, refLi="mainSpe", sdrf=FALSE, suplAnnotFile=TRUE) ## Here we'll write simply in the current temporary directory of this R-session exportSdrfDraft(dataMQ, file.path(tempdir(),"testSdrf.tsv"))
This function allows convenient access to results produced using the functions moderTest2grp
or moderTestXgrp
.
The user can define the threshold which type of multiple testing correction should be used
(as long as the multiple testing correction method was actually performed as part of testing).
extractTestingResults( stat, compNo = 1, statTy = "BH", thrsh = 0.05, FCthrs = 1.5, annotCol = c("Accession", "EntryName", "GeneName"), nSign = 6, addTy = c("allMeans"), filename = NULL, fileTy = "csvUS", silent = FALSE, debug = FALSE, callFrom = NULL )
extractTestingResults( stat, compNo = 1, statTy = "BH", thrsh = 0.05, FCthrs = 1.5, annotCol = c("Accession", "EntryName", "GeneName"), nSign = 6, addTy = c("allMeans"), filename = NULL, fileTy = "csvUS", silent = FALSE, debug = FALSE, callFrom = NULL )
stat |
('MArrayLM'-object or list) Designed for the output from |
compNo |
(integer) the comparison number/index to be used |
statTy |
(character) the multiple-testing correction type to be considered when looking for significant changes with threshold |
thrsh |
(numeric) the threshold to be applied on |
FCthrs |
(numeric) Fold-Change threshold given as Fold-change and NOT log2(FC), default at 1.5 (for filtering at M-value =0.585) |
annotCol |
(character) column-names from the annotation to be included |
nSign |
(integer) number of significant digits whe returning results |
addTy |
(character) additional groups to add (so far only "allMeans" available) in addition to the means used in the pairwise comparison |
filename |
(character) optional (path and) file-name for exporting results to csv-file |
fileTy |
(character) file-type to be used with argument |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function returns a limma-type MA-object (which can be handeled just like a list)
testRobustToNAimputation
, moderTestXgrp
or moderTest2grp
grp <- factor(rep(LETTERS[c(3,1,4)],c(2,3,3))) set.seed(2017); t8 <- matrix(round(rnorm(208*8,10,0.4),2), ncol=8, dimnames=list(paste(letters[],rep(1:8,each=26),sep=""), paste(grp,c(1:2,1:3,1:3),sep=""))) t8[3:6,1:2] <- t8[3:6,1:2] +3 # augment lines 3:6 (c-f) t8[5:8,c(1:2,6:8)] <- t8[5:8,c(1:2,6:8)] -1.5 # lower lines t8[6:7,3:5] <- t8[6:7,3:5] +2.2 # augment lines ## expect to find C/A in c,d,g, (h) ## expect to find C/D in c,d,e,f ## expect to find A/D in f,g,(h) library(wrMisc) # for testing we'll use this package test8 <- moderTestXgrp(t8, grp) extractTestingResults(test8)
grp <- factor(rep(LETTERS[c(3,1,4)],c(2,3,3))) set.seed(2017); t8 <- matrix(round(rnorm(208*8,10,0.4),2), ncol=8, dimnames=list(paste(letters[],rep(1:8,each=26),sep=""), paste(grp,c(1:2,1:3,1:3),sep=""))) t8[3:6,1:2] <- t8[3:6,1:2] +3 # augment lines 3:6 (c-f) t8[5:8,c(1:2,6:8)] <- t8[5:8,c(1:2,6:8)] -1.5 # lower lines t8[6:7,3:5] <- t8[6:7,3:5] +2.2 # augment lines ## expect to find C/A in c,d,g, (h) ## expect to find C/D in c,d,e,f ## expect to find A/D in f,g,(h) library(wrMisc) # for testing we'll use this package test8 <- moderTestXgrp(t8, grp) extractTestingResults(test8)
extrSpeciesAnnot
identifies species-related annotation (as suffix to identifyers) for data comnining multiple species and returns alternative (short) names.
This function also suppresses extra heading or tailing space or punctuation characters.
In case multiple tags are found, the last tag is reported and a message of alert may be displayed.
extrSpeciesAnnot( annot, spec = c("_CONT", "_HUMAN", "_YEAST", "_ECOLI"), shortNa = c("cont", "H", "S", "E"), silent = FALSE, debug = FALSE, callFrom = NULL )
extrSpeciesAnnot( annot, spec = c("_CONT", "_HUMAN", "_YEAST", "_ECOLI"), shortNa = c("cont", "H", "S", "E"), silent = FALSE, debug = FALSE, callFrom = NULL )
annot |
(character) vector with initial annotation |
spec |
(character) the tags to be identified |
shortNa |
(character) the final abbreviation used, order and lengt must fit to argument |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function returns a character vector with single (last of multiple) term if found in argument annot
spec <- c("keratin_CONT","AB_HUMAN","CD_YEAST","EF_G_HUMAN","HI_HUMAN_ECOLI","_YEAST_012") extrSpeciesAnnot(spec)
spec <- c("keratin_CONT","AB_HUMAN","CD_YEAST","EF_G_HUMAN","HI_HUMAN_ECOLI","_YEAST_012") extrSpeciesAnnot(spec)
NOTE : This function is deprecated, please use foldChangeArrow
instead !!
This function was made for adding an arrow indicating a fold-change to MA- or Volcano-plots.
When comparing mutiple concentratios of standards in benchmark-tests it may be useful to indicate the expected ratio in a pair-wise comparison.
In case of main input as list or MArrayLM-object (as generated from limma), the colum-names of multiple pairwise comparisons can be used
for extracting a numeric content (supposed as concentrations in sample-names) which will be used to determine the expected ratio used for plotting.
Optionally the ratio used for plotting can be returned as numeric value.
foldChangeArrow2( FC, useComp = 1, isLin = TRUE, asX = TRUE, col = 1, arr = c(0.005, 0.15), lwd = NULL, addText = c(line = -0.9, cex = 0.7, txt = "expected", loc = "toright"), returnRatio = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
foldChangeArrow2( FC, useComp = 1, isLin = TRUE, asX = TRUE, col = 1, arr = c(0.005, 0.15), lwd = NULL, addText = c(line = -0.9, cex = 0.7, txt = "expected", loc = "toright"), returnRatio = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
FC |
(numeric, list or MArrayLM-object) main information for drawing arrow : either numeric value for fold-change/log2-ratio of object to search for colnames of statistical testing for extracting numeric part |
useComp |
(integer) only used in case FC is list or MArrayLM-object an has multiple pairwise-comparisons |
isLin |
(logical) inidicate if |
asX |
(logical) indicate if arrow should be on x-axis |
col |
(integer or character) custom color |
arr |
(numeric, length=2) start- and end-points of arrow (as relative to entire plot) |
lwd |
(numeric) line-width of arrow |
addText |
(logical or named vector) indicate if text explaining arrow should be displayed, use |
returnRatio |
(logical) return ratio |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
The argument addText
also allows specifying a fixed position when using addText=c(loc="bottomleft")
, also bottomright, topleft, topright, toleft and toright may be used.
In this case the elemts side
and adjust
will be redefined to accomodate the text in the corner specified.
Ultimately this function will be integated to the package wrGraph.
plots arrow only (and explicative text), if returnRatio=TRUE
also returns numeric value for extracted ratio
new version : foldChangeArrow
; used with MAplotW
, VolcanoPlotW
plot(rnorm(20,1.5,0.1),1:20) #deprecated# foldChangeArrow2(FC=1.5)
plot(rnorm(20,1.5,0.1),1:20) #deprecated# foldChangeArrow2(FC=1.5)
This function allows combining up to 3 separate data-sets previously imported using wrProteo.
fuseProteomicsProjects( x, y, z = NULL, columnNa = "Accession", NA.rm = TRUE, listNa = c(quant = "quant", annot = "annot"), all = FALSE, textModif = NULL, shortNa = NULL, retProtLst = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
fuseProteomicsProjects( x, y, z = NULL, columnNa = "Accession", NA.rm = TRUE, listNa = c(quant = "quant", annot = "annot"), all = FALSE, textModif = NULL, shortNa = NULL, retProtLst = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
x |
(list) First Proteomics data-set |
y |
(list) Second Proteomics data-set |
z |
(list) optional third Proteomics data-set |
columnNa |
(character) column names from annotation |
NA.rm |
(logical) remove |
listNa |
(character) names of key list-elemnts from |
all |
(logical) union of intersect or merge should be performed between x, y and z |
textModif |
(character) Additional modifications to the identifiers from argument |
shortNa |
(character) for appending to output-colnames |
retProtLst |
(logical) return list-object similar to input, otherwise a matrix of fused/aligned quantitation data |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
Some quantification software way give some identifyers multiple times, ie as multiple lines (eg for different modifictions or charge states, etc).
In this case this function tries first to summarize all lines with identical identifyers (using the function combineRedundLinesInList
which used by default the median value).
Thus, it is very important to know your data and to understand when lines that appear with the same identifyers should/may be fused/summarized without
doing damage to the later biological interpretation ! The user may specify for each dataset the colum out of the protein/peptide-annotation to use
via the argument columnNa
.
Then, this content will be matched as identical match, so when combining data from different software special care shoud be taken !
Please note, that (at this point) the data from different series/objects will be joined as they are, ie without any additional normalization. It is up to the user to inspect the resulting data and to decide if and which type of normalization may be suitable !
Please do NOT try combining protein and peptide quntification data.
This function returns a list with the same number of list-elements as $x
, ie typically this contains :
$raw
(initial/raw abundance values), $quant
with final normalized quantitations,
$annot
, optionally $counts
an array with number of peptides, $quantNotes
or $notes
path1 <- system.file("extdata", package="wrProteo") dataMQ <- readMaxQuantFile(path1, specPref=NULL, normalizeMeth="median") MCproFi1 <- "tinyMC.RData" dataMC <- readMassChroQFile(path1, file=MCproFi1, plotGraph=FALSE) dataFused <- fuseProteomicsProjects(dataMQ, dataMC) dim(dataMQ$quant) dim(dataMC$quant) dim(dataFused$quant)
path1 <- system.file("extdata", package="wrProteo") dataMQ <- readMaxQuantFile(path1, specPref=NULL, normalizeMeth="median") MCproFi1 <- "tinyMC.RData" dataMC <- readMassChroQFile(path1, file=MCproFi1, plotGraph=FALSE) dataFused <- fuseProteomicsProjects(dataMQ, dataMC) dim(dataMQ$quant) dim(dataMC$quant) dim(dataFused$quant)
UPS1 (see https://www.sigmaaldrich.com/FR/en/product/sigma/ups1) and UPS2 are commerical products consisting of a mix of 48 human (purified) proteins. They are frequently used as standard in spike-in experiments, available from Sigma-Aldrich. This function allows accessing their protein accession numbers and associated names on UniProt
getUPS1acc(updated = TRUE)
getUPS1acc(updated = TRUE)
updated |
(logical) return updated accession number (of UBB) |
Please note that the UniProt accession 'P62988' for 'UBIQ_HUMAN' (as originally cited by Sigma-Aldrich)
has been withdrawn and replaced in 2010 by UniProt by the accessions 'P0CG47', 'P0CG48', 'P62979', and 'P62987'.
This initial accession is available via getUPS1acc()$acOld
, now getUPS1acc()$ac
contains 'P0CG47'.
This function returns data.frame with accession-numbers as stated by the supplier ($acFull
),
trimmed accession-numbers, ie without version numbers ($ac
),
and associated (UniProt
) entry-names ($EntryName
) from UniProt
as well as the species designation for the collection of 48 human UPS1 or UPS2 proteins.
head(getUPS1acc())
head(getUPS1acc())
This function inspects its main argument to convert a species indication to the scientific name or to return all protein-accession numbers for a name of a standard collection like UPS1.
inspectSpeciesIndic(x, silent = FALSE, debug = FALSE, callFrom = NULL)
inspectSpeciesIndic(x, silent = FALSE, debug = FALSE, callFrom = NULL)
x |
(character) species indication or name of collection of proteins (so far only UPS1 & UPS2) |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
This function returns a character vector
inspectSpeciesIndic("Human") inspectSpeciesIndic("UPS1")
inspectSpeciesIndic("Human") inspectSpeciesIndic("UPS1")
This functions extracts all replicate-values where at least one of the replicates is NA
and sorts by number of NA
s per group.
A list with all NA
-neighbours organized by the number of NA
s gets returned.
isolNAneighb(mat, gr, silent = FALSE, debug = FALSE, callFrom = NULL)
isolNAneighb(mat, gr, silent = FALSE, debug = FALSE, callFrom = NULL)
mat |
(matrix or data.frame) main data (may contain |
gr |
(character or factor) grouping of columns of 'mat', replicate association |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function returns a list with NA-neighbours sorted by number of NAs in replicate group
This function gets used by matrixNAneighbourImpute
and testRobustToNAimputation
; estimation of mode stableMode
; detection of NAs na.fail
mat1 <- c(22.2, 22.5, 22.2, 22.2, 21.5, 22.0, 22.1, 21.7, 21.5, 22, 22.2, 22.7, NA, NA, NA, NA, NA, NA, NA, 21.2, NA, NA, NA, NA, NA, 22.6, 23.2, 23.2, 22.4, 22.8, 22.8, NA, 23.3, 23.2, NA, 23.7, NA, 23.0, 23.1, 23.0, 23.2, 23.2, NA, 23.3, NA, NA, 23.3, 23.8) mat1 <- matrix(mat1, ncol=12, byrow=TRUE) gr4 <- gl(3, 4) isolNAneighb(mat1, gr4)
mat1 <- c(22.2, 22.5, 22.2, 22.2, 21.5, 22.0, 22.1, 21.7, 21.5, 22, 22.2, 22.7, NA, NA, NA, NA, NA, NA, NA, 21.2, NA, NA, NA, NA, NA, 22.6, 23.2, 23.2, 22.4, 22.8, 22.8, NA, 23.3, 23.2, NA, 23.7, NA, 23.0, 23.1, 23.0, 23.2, 23.2, NA, 23.3, NA, NA, 23.3, 23.8) mat1 <- matrix(mat1, ncol=12, byrow=TRUE) gr4 <- gl(3, 4) isolNAneighb(mat1, gr4)
Calculate molecular mass based on atomic composition
massDeFormula( comp, massTy = "mono", rmEmpty = FALSE, silent = FALSE, callFrom = NULL )
massDeFormula( comp, massTy = "mono", rmEmpty = FALSE, silent = FALSE, callFrom = NULL )
comp |
(character) atomic compostion |
massTy |
(character) 'mono' or 'average' |
rmEmpty |
(logical) suppress empty entries |
silent |
(logical) suppress messages |
callFrom |
(character) allow easier tracking of messages produced |
This function returns a numeric vector with mass
massDeFormula(c("12H12O","HO"," 2H 1 Se, 6C 2N","HSeCN"," ","e"))
massDeFormula(c("12H12O","HO"," 2H 1 Se, 6C 2N","HSeCN"," ","e"))
matrixNAinspect
makes histograms of the full data and shows sub-population of NA
-neighbour values.
The aim of this function is to investigate the nature of NA
values in matrix (of experimental measures) where replicate measurements are available.
If a given element was measured twice, and one of these measurements revealed a NA
while the other one gave a (finite) numeric value, the non-NA-value is considered a NA
-neighbour.
The subpopulation of these NA
-neighbour values will then be highlighted in the resulting histogram.
In a number of experimental settiongs some actual measurements may not meet an arbitrary defined baseline (as 'zero') or may be too low to be distinguishable from noise that
associated measures were initially recorded as NA
. In several types of measurments in proteomics and transcriptomics this may happen.
So this fucntion allows to collect all NA
-neighbour values and compare them to the global distribution of the data to investigate if NA
-neighbours are typically very low values.
In case of data with multiple replicates NA
-neighbour values may be distinguished for the case of 2 NA
per group/replicate-set.
The resulting plots are typically used to decide if and how NA
values may get replaced by imputed random values or wether measues containing NA
-values should rather me omitted.
Of course, such decisions do have a strong impact on further steps of data-analysis and should be performed with care.
matrixNAinspect( dat, gr = NULL, retnNA = TRUE, xLab = NULL, tit = NULL, xLim = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
matrixNAinspect( dat, gr = NULL, retnNA = TRUE, xLab = NULL, tit = NULL, xLim = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(matrix or data.frame) main numeric data |
gr |
(charcter or factor) grouping of columns of dat indicating who is a replicate of whom (ie the length of 'gr' must be equivalent to the number of columns in 'dat') |
retnNA |
(logical) report number of NAs in graphic |
xLab |
(character) custom x-label |
tit |
(character) custom title |
xLim |
(numerical,length=2) custom x-axis limits |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function produces a graphic (to the current graphical device)
set.seed(2013) datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste("li",1:50,sep=""), letters[19:24])) datT6 <- datT6 +matrix(rep(1:nrow(datT6),ncol(datT6)), ncol=ncol(datT6)) datT6[6:7,c(1,3,6)] <- NA datT6[which(datT6 < 11 & datT6 > 10.5)] <- NA datT6[which(datT6 < 6 & datT6 > 5)] <- NA datT6[which(datT6 < 4.6 & datT6 > 4)] <- NA matrixNAinspect(datT6, gr=gl(2,3))
set.seed(2013) datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste("li",1:50,sep=""), letters[19:24])) datT6 <- datT6 +matrix(rep(1:nrow(datT6),ncol(datT6)), ncol=ncol(datT6)) datT6[6:7,c(1,3,6)] <- NA datT6[which(datT6 < 11 & datT6 > 10.5)] <- NA datT6[which(datT6 < 6 & datT6 > 5)] <- NA datT6[which(datT6 < 4.6 & datT6 > 4)] <- NA matrixNAinspect(datT6, gr=gl(2,3))
It is assumed that NA
-values appear in data when quantitation values are very low (as this appears eg in quantitative shotgun proteomics).
Here, the concept of (technical) replicates is used to investigate what kind of values appear in the other replicates next to NA-values for the same line/protein.
Groups of replicate samples are defined via argument gr
which descibes the columns of dat
).
Then, they are inspected for each line to gather NA-neighbour values (ie those values where NAs and regular measures are observed the same time).
Eg, let's consider a line contains a set of 4 replicates for a given group. Now, if 2 of them are NA
-values, the remaining 2 non-NA
-values will be considered as NA-neighbours.
Ultimately, the aim is to replaces all NA
-values based on values from a normal distribution ressembling theire respective NA-neighbours.
matrixNAneighbourImpute( dat, gr, imputMethod = "mode2", retnNA = TRUE, avSd = c(0.15, 0.5), avSdH = NULL, NAneigLst = NULL, plotHist = c("hist", "mode"), xLab = NULL, xLim = NULL, yLab = NULL, yLim = NULL, tit = NULL, figImputDetail = TRUE, seedNo = NULL, silent = FALSE, callFrom = NULL, debug = FALSE )
matrixNAneighbourImpute( dat, gr, imputMethod = "mode2", retnNA = TRUE, avSd = c(0.15, 0.5), avSdH = NULL, NAneigLst = NULL, plotHist = c("hist", "mode"), xLab = NULL, xLim = NULL, yLab = NULL, yLim = NULL, tit = NULL, figImputDetail = TRUE, seedNo = NULL, silent = FALSE, callFrom = NULL, debug = FALSE )
dat |
(matrix or data.frame) main data (may contain |
gr |
(character or factor) grouping of columns of 'dat', replicate association |
imputMethod |
(character) choose the imputation method (may be 'mode2'(default), 'mode1', 'datQuant', 'modeAdopt' or 'informed') |
retnNA |
(logical) decide (if = |
avSd |
(numerical,length=2) population characteristics 'high' (mean and sd) for >1 |
avSdH |
depreciated, please use |
NAneigLst |
(list) option for repeated rounds of imputations: list of |
plotHist |
(character or logical) decide if supplemental figure with histogram shoud be drawn, the details 'Hist','quant' (display quantile of originak data), 'mode' (display mode of original data) can be chosen explicitely |
xLab |
(character) label on x-axis on plot |
xLim |
(numeric, length=2) custom x-axis limits |
yLab |
(character) label on y-axis on plot |
yLim |
(numeric, length=2) custom y-axis limits |
tit |
(character) title on plot |
figImputDetail |
(logical) display details about data (number of NAs) and imputation in graph (min number of NA-neighbours per protein and group, quantile to model, mean and sd of imputed) |
seedNo |
(integer) seed-value for normal random values |
silent |
(logical) suppress messages |
callFrom |
(character) allow easier tracking of messages produced |
debug |
(logical) supplemental messages for debugging |
By default a histogram gets plotted showing the initial, imputed and final distribution to check the global hypothesis that NA
-values arose
from very low measurements and to appreciate the impact of the imputed values to the overall final distribution.
There are a number of experimental settings where low measurements may be reported as NA
.
Sometimes an arbitrary defined baseline (as 'zero') may provoke those values found below being unfortunately reported as NA
or as 0 (in case of MaxQuant).
In quantitative proteomics (DDA-mode) the presence of numerous high-abundance peptides will lead to the fact that a number of less
intense MS-peaks don't get identified properly and will then be reported as NA
in the respective samples,
while the same peptides may by correctly identified and quantified in other (replicate) samples.
So, if a given protein/peptide gets properly quantified in some replicate samples but reported as NA
in other replicate samples
one may thus speculate that similar values like in the successful quantifications may have occored.
Thus, imputation of NA
-values may be done on the basis of NA
-neighbours.
When extracting NA
-neighbours, a slightly more focussed approach gets checked, too, the 2-NA
-neighbours : In case a set of replicates for a given protein
contains at least 2 non-NA
-values (instead of just one) it will be considered as a (min) 2-NA
-neighbour as well as regular NA
-neighbour.
If >300 of these (min) 2-NA
-neighbours get found, they will be used instead of the regular NA
-neighbours.
For creating a collection of normal random values one may use directly the mode of the NA
-neighbours (or 2-NA
-neighbours, if >300 such values available).
To do so, the first value of argument avSd
must be set to NA
. Otherwise, the first value avSd
will be used as quantile of all data to define the mean
for the imputed data (ie as quantile(dat, avSd[1], na.rm=TRUE)
). The sd for generating normal random values will be taken from the sd of all NA
-neighbours (or 2-NA
-neighbours)
multiplied by the second value in argument avSd
(or avSd
, if >300 2-NA
-neighbours), since the sd of the NA
-neighbours is usually quite high.
In extremely rare cases it may happen that no NA
-neighbours are found (ie if NA
s occur, all replicates are NA
).
Then, this function replaces NA
-values based on the normal random values obtained as dscribed above.
This function returns a list with $data
.. matrix of data where NA
are replaced by imputed values, $nNA
.. number of NA
by group, $randParam
.. parameters used for making random data
this function gets used by testRobustToNAimputation
; estimation of mode stableMode
; detection of NAs na.fail
set.seed(2013) datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste("li",1:50,sep=""), letters[19:24])) datT6 <- datT6 +matrix(rep(1:nrow(datT6), ncol(datT6)), ncol=ncol(datT6)) datT6[6:7, c(1,3,6)] <- NA datT6[which(datT6 < 11 & datT6 > 10.5)] <- NA datT6[which(datT6 < 6 & datT6 > 5)] <- NA datT6[which(datT6 < 4.6 & datT6 > 4)] <- NA datT6b <- matrixNAneighbourImpute(datT6, gr=gl(2,3)) head(datT6b$data)
set.seed(2013) datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste("li",1:50,sep=""), letters[19:24])) datT6 <- datT6 +matrix(rep(1:nrow(datT6), ncol(datT6)), ncol=ncol(datT6)) datT6[6:7, c(1,3,6)] <- NA datT6[which(datT6 < 11 & datT6 > 10.5)] <- NA datT6[which(datT6 < 6 & datT6 > 5)] <- NA datT6[which(datT6 < 4.6 & datT6 > 4)] <- NA datT6b <- matrixNAneighbourImpute(datT6, gr=gl(2,3)) head(datT6b$data)
plotROC
plots ROC curves based on results from summarizeForROC
.
This function plots only, it does not return any data. It allows printing simultaneously multiple ROC curves from different studies,
it is also compatible with data from 3 species mix as in proteomics benchmark.
Input can be prepared using moderTest2grp
followed by summarizeForROC
.
plotROC( dat, ..., useColumn = 2:3, methNames = NULL, col = NULL, pch = 1, bg = NULL, tit = NULL, xlim = NULL, ylim = NULL, point05 = 0.05, pointSi = 0.85, nByMeth = NULL, speciesOrder = NULL, txtLoc = NULL, legCex = 0.72, las = 1, addSuplT = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
plotROC( dat, ..., useColumn = 2:3, methNames = NULL, col = NULL, pch = 1, bg = NULL, tit = NULL, xlim = NULL, ylim = NULL, point05 = 0.05, pointSi = 0.85, nByMeth = NULL, speciesOrder = NULL, txtLoc = NULL, legCex = 0.72, las = 1, addSuplT = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(matrix) from testing (eg |
... |
optional additional data-sets to include as seprate ROC-curves to same plot (must be of same type of format as 'dat') |
useColumn |
(integer or character, length=2) columns from |
methNames |
(character) names of methods (data-sets) to be displayed |
col |
(character) custom colors for lines and text (choose one color for each different data-set) |
pch |
(integer) type of symbol to be used (see also |
bg |
(character) background color in plot (see also |
tit |
(character) custom title |
xlim |
(numeric, length=2) custom x-axis limits |
ylim |
(numeric, length=2) custom y-axis limits |
point05 |
(numeric) specific point to highlight in plot (typically at alpha=0.05) |
pointSi |
(numeric) size of points (as expansion factor |
nByMeth |
(integer) value of n to display |
speciesOrder |
(integer) custom order of species in legend |
txtLoc |
(numeric, length=3) location for text (x, y location and proportional factor for line-offset, default is c(0.4,0.3,0.04)) |
legCex |
(numeric) cex expansion factor for legend (see also |
las |
(numeric) factor for text-orientation (see also |
addSuplT |
(logical) add text with information about precision,accuracy and FDR |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
This function returns only a plot with ROC curves
summarizeForROC
, moderTest2grp
roc0 <- cbind(alph=c(2e-6,4e-5,4e-4,2.7e-3,1.6e-2,4.2e-2,8.3e-2,1.7e-1,2.7e-1,4.1e-1,5.3e-1, 6.8e-1,8.3e-1,9.7e-1), spec=c(1,1,1,1,0.957,0.915,0.915,0.809,0.702,0.489,0.362,0.234, 0.128,0.0426), sens=c(0,0,0.145,0.942,2.54,2.68,3.33,3.99,4.71,5.87,6.67,8.04,8.77, 9.93)/10, n.pos.a=c(0,0,0,0,2,4,4,9,14,24,36,41) ) plotROC(roc0)
roc0 <- cbind(alph=c(2e-6,4e-5,4e-4,2.7e-3,1.6e-2,4.2e-2,8.3e-2,1.7e-1,2.7e-1,4.1e-1,5.3e-1, 6.8e-1,8.3e-1,9.7e-1), spec=c(1,1,1,1,0.957,0.915,0.915,0.809,0.702,0.489,0.362,0.234, 0.128,0.0426), sens=c(0,0,0.145,0.942,2.54,2.68,3.33,3.99,4.71,5.87,6.67,8.04,8.77, 9.93)/10, n.pos.a=c(0,0,0,0,2,4,4,9,14,24,36,41) ) plotROC(roc0)
razorNoFilter
filters based on either a) number of total peptides and specific peptides or b) numer of razor petides.
This function was designed for filtering using a mimimum number of (PSM-) count values following the common practice to consider results with 2 or more peptide counts as reliable.
The function be (re-)run independently on each of various questions (comparisons).
Note: Non-integer data will be truncated to integer (equivalent to floor
).
razorNoFilter( annot, speNa = NULL, totNa = NULL, minRazNa = NULL, minSpeNo = 1, minTotNo = 2, silent = FALSE, debug = FALSE, callFrom = NULL )
razorNoFilter( annot, speNa = NULL, totNa = NULL, minRazNa = NULL, minSpeNo = 1, minTotNo = 2, silent = FALSE, debug = FALSE, callFrom = NULL )
annot |
(matrix or data.frame) main data (may contain NAs) with (PSM-) count values for each protein |
speNa |
(integer or character) indicate which column of 'annot' has number of specific peptides |
totNa |
(integer or character) indicate which column of 'annot' has number of total peptides |
minRazNa |
(integer or character) name of column with number of razor peptides, alternative to 'minSpeNo'& 'minTotNo' |
minSpeNo |
(integer) minimum number of pecific peptides |
minTotNo |
(integer) minimum total ie max razor number of peptides |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function returns a vector of logical values if corresponding line passes filter criteria
set.seed(2019); datT <- matrix(sample.int(20,60,replace=TRUE), ncol=6, dimnames=list(letters[1:10], LETTERS[1:6])) -3 datT[,2] <- datT[,2] +2 datT[which(datT <0)] <- 0 razorNoFilter(datT, speNa="A", totNa="B")
set.seed(2019); datT <- matrix(sample.int(20,60,replace=TRUE), ncol=6, dimnames=list(letters[1:10], LETTERS[1:6])) -3 datT[,2] <- datT[,2] +2 datT[which(datT <0)] <- 0 razorNoFilter(datT, speNa="A", totNa="B")
Protein quantification results from AlphaPept can be read using this function. Input files compressed as .gz can be read as well. The protein abundance values (XIC) get extracted. Since protein annotation is not very extensive with this format of data, the function allows reading the initial fasta files (from the directory above the quantitation-results) allowing to extract more protein-annotation (like species). Sample-annotation (if available) can be extracted from sdrf files, too. The protein abundance values may be normalized using multiple methods (median normalization as default), the determination of normalization factors can be restricted to specific proteins (normalization to bait protein(s), or to invariable matrix of spike-in experiments). The protein annotation data gets parsed to extract specific fields (ID, name, description, species ...). Besides, a graphical display of the distribution of protein abundance values may be generated before and after normalization.
readAlphaPeptFile( fileName = "results_proteins.csv", path = NULL, fasta = NULL, isLog2 = FALSE, normalizeMeth = "none", quantCol = "_LFQ$", contamCol = NULL, read0asNA = TRUE, refLi = NULL, sampleNames = NULL, specPref = NULL, extrColNames = NULL, remRev = TRUE, remConta = FALSE, separateAnnot = TRUE, gr = NULL, sdrf = NULL, suplAnnotFile = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = NULL, wex = 1.6, plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
readAlphaPeptFile( fileName = "results_proteins.csv", path = NULL, fasta = NULL, isLog2 = FALSE, normalizeMeth = "none", quantCol = "_LFQ$", contamCol = NULL, read0asNA = TRUE, refLi = NULL, sampleNames = NULL, specPref = NULL, extrColNames = NULL, remRev = TRUE, remConta = FALSE, separateAnnot = TRUE, gr = NULL, sdrf = NULL, suplAnnotFile = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = NULL, wex = 1.6, plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read (default 'results_proteins.csv'). Gz-compressed files can be read, too. |
path |
(character) path of file to be read |
fasta |
(logical or character) if |
isLog2 |
(logical) typically data read from AlphaPept are expected NOT to be |
normalizeMeth |
(character) normalization method, defaults to |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
contamCol |
(character or integer, length=1) which columns should be used for contaminants |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA (thus avoid -Inf in log2 results) |
refLi |
(character or integer) custom specify which line of data should be used for normalization, ie which line is main species; if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
specPref |
(character) prefix to identifiers allowing to separate i) recognize contamination database, ii) species of main identifications and iii) spike-in species |
extrColNames |
(character or |
remRev |
(logical) option to remove all protein-identifications based on reverse-peptides |
remConta |
(logical) option to remove all proteins identified as contaminants |
separateAnnot |
(logical) if |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second & third elements may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by Compomics; if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
titGraph |
(character) custom title to plot of distribution of quantitation values |
wex |
(numeric) relative expansion factor of the violin in plot |
plotGraph |
(logical) optional plot vioplot of initial and normalized data (using |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
Meta-data describing the samples and experimental setup may be available from a sdrf-file (from the directory above the analysis/quantiication results)
If available, the meta-data will be examined for determining groups of replicates and
the results thereof can be found in $sampleSetup$levels.
Alternatively, a dataframe formatted like sdrf-files (ie for each sample a separate line, see also function readSdrf
) may be given, too.
This import-function has been developed using AlphaPept version x.x.
The final output is a list containing these elements: $raw
, $quant
, $annot
, $counts
, $sampleSetup
, $quantNotes
, $notes
, or (if separateAnnot=FALSE
) data.frame
with annotation- and main quantification-content. If sdrf
information has been found, an add-tional list-element setup
will be added containg the entire meta-data as setup$meta
and the suggested organization as setup$lev
.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
(columns ), $counts
an array with 'PSM' and 'NoOfRazorPeptides',
$quantNotes
, $notes
and optional setup
for meta-data from sdrf
; or a data.frame with quantitation and annotation if separateAnnot=FALSE
read.table
, normalizeThis
) , readProteomeDiscovererFile
; readProlineFile
(and other import-functions), matrixNAinspect
path1 <- system.file("extdata", package="wrProteo") # Here we'll load a short/trimmed example file fiNaAP <- "tinyAlpaPeptide.csv.gz" dataAP <- readAlphaPeptFile(file=fiNaAP, path=path1, tit="tiny AlphaPaptide ") summary(dataAP$quant)
path1 <- system.file("extdata", package="wrProteo") # Here we'll load a short/trimmed example file fiNaAP <- "tinyAlpaPeptide.csv.gz" dataAP <- readAlphaPeptFile(file=fiNaAP, path=path1, tit="tiny AlphaPaptide ") summary(dataAP$quant)
This function allows importing protein identification and quantification results from DIA-NN.
Data should be exported as tabulated text (tsv) as protein-groups (pg) to allow import by thus function.
Quantification data and other relevant information will be parsed and extracted (similar to the other import-functions from this package).
The final output is a list containing as (main) elements: $annot
, $raw
and $quant
, or a data.frame with the quantication data and a part of the annotation if argument separateAnnot=FALSE
.
readDiaNNFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "\\.raw$", annotCol = NULL, refLi = NULL, separateAnnot = TRUE, FDRCol = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), plotGraph = TRUE, titGraph = "DiaNN", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
readDiaNNFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "\\.raw$", annotCol = NULL, refLi = NULL, separateAnnot = TRUE, FDRCol = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), plotGraph = TRUE, titGraph = "DiaNN", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method, defaults to |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA (thus avoid -Inf in log2 results) |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
annotCol |
(character) column names to be read/extracted for the annotation section (default c("Accession","Description","Gene","Contaminant","Sum.PEP.Score","Coverage....","X..Peptides","X..PSMs","X..Unique.Peptides", "X..AAs","MW..kDa.") ) |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
FDRCol |
- not used (the argument was kept to remain with the same synthax as the other import functions fo this package) |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
plotGraph |
(logical or integer) optional plot of type vioplot of initial and normalized data (using |
titGraph |
(character) custom title to plot of distribution of quantitation values |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second element may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files; however, if |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function has been developed using DIA-NN version 1.8.x. Note, reading gene-group (gg) files is in priciple possible, but resulting files typically lack protein-identifiers which may be less convenient in later steps of analysis. Thus, it is suggested to rather read protein-group (pg) files.
Using the argument suplAnnotFile
it is possible to specify a specific file (or search for default file) to read for extracting file-names as sample-names and other experiment related information.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readMaxQuantFile
, readProtDiscovFile
, readProlineFile
diaNNFi1 <- "tinyDiaNN1.tsv.gz" ## This file contains much less identifications than one may usually obtain path1 <- system.file("extdata", package="wrProteo") ## let's define the main species and allow tagging some contaminants specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="HUMAN") dataNN <- readDiaNNFile(path1, file=diaNNFi1, specPref=specPref1, tit="Tiny DIA-NN Data") summary(dataNN$quant)
diaNNFi1 <- "tinyDiaNN1.tsv.gz" ## This file contains much less identifications than one may usually obtain path1 <- system.file("extdata", package="wrProteo") ## let's define the main species and allow tagging some contaminants specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="HUMAN") dataNN <- readDiaNNFile(path1, file=diaNNFi1, specPref=specPref1, tit="Tiny DIA-NN Data") summary(dataNN$quant)
This function allows importing peptide identification and quantification results from DiaNN.
Data should be exported as tabulated text (tsv) to allow import by thus function.
Quantification data and other relevant information will be extracted similar like the other import-functions from this package.
The final output is a list containing as (main) elements: $annot
, $raw
and $quant
, or a data.frame with the quantication data and a part of the annotation if argument separateAnnot=FALSE
.
readDiaNNPeptides( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "\\.raw$", annotCol = NULL, refLi = NULL, separateAnnot = TRUE, FDRCol = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), plotGraph = TRUE, titGraph = "DiaNN", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
readDiaNNPeptides( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "\\.raw$", annotCol = NULL, refLi = NULL, separateAnnot = TRUE, FDRCol = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), plotGraph = TRUE, titGraph = "DiaNN", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method, defaults to |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA (thus avoid -Inf in log2 results) |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
annotCol |
(character) column names to be read/extracted for the annotation section (default c("Accession","Description","Gene","Contaminant","Sum.PEP.Score","Coverage....","X..Peptides","X..PSMs","X..Unique.Peptides", "X..AAs","MW..kDa.") ) |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
FDRCol |
(list) - not used |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
plotGraph |
(logical or integer) optional plot of type vioplot of initial and normalized data (using |
titGraph |
(character) custom title to plot of distribution of quantitation values |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second element may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files; however, if |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function has been developed using DiaNN version 1.8.x.
Using the argument suplAnnotFile
it is possible to specify a specific file (or search for default file) to read for extracting file-names as sample-names and other experiment related information.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readMaxQuantFile
, readProtDiscovFile
, readProlineFile
diaNNFi1 <- "tinyDiaNN1.tsv.gz" ## This file contains much less identifications than one may usually obtain path1 <- system.file("extdata", package="wrProteo") ## let's define the main species and allow tagging some contaminants specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="HUMAN") dataNN <- readDiaNNFile(path1, file=diaNNFi1, specPref=specPref1, tit="Tiny DIA-NN Data") summary(dataNN$quant)
diaNNFi1 <- "tinyDiaNN1.tsv.gz" ## This file contains much less identifications than one may usually obtain path1 <- system.file("extdata", package="wrProteo") ## let's define the main species and allow tagging some contaminants specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="HUMAN") dataNN <- readDiaNNFile(path1, file=diaNNFi1, specPref=specPref1, tit="Tiny DIA-NN Data") summary(dataNN$quant)
Read fasta formatted file (from UniProt) to extract (protein) sequences and name.
If tableOut=TRUE
output may be organized as matrix for separating meta-annotation (eg uniqueIdentifier, entryName, proteinName, GN) in separate columns.
readFasta2( filename, delim = "|", databaseSign = c("sp", "tr", "generic", "gi"), removeEntries = NULL, tableOut = FALSE, UniprSep = c("OS=", "OX=", "GN=", "PE=", "SV="), strictSpecPattern = TRUE, cleanCols = TRUE, silent = FALSE, callFrom = NULL, debug = FALSE )
readFasta2( filename, delim = "|", databaseSign = c("sp", "tr", "generic", "gi"), removeEntries = NULL, tableOut = FALSE, UniprSep = c("OS=", "OX=", "GN=", "PE=", "SV="), strictSpecPattern = TRUE, cleanCols = TRUE, silent = FALSE, callFrom = NULL, debug = FALSE )
filename |
(character) names fasta-file to be read |
delim |
(character) delimeter at header-line |
databaseSign |
(character) characters at beginning right after the '>' (typically specifying the data-base-origin), they will be excluded from the sequance-header |
removeEntries |
(character) if |
tableOut |
(logical) toggle to return named character-vector or matrix with enhaced parsing of fasta-header.
The resulting matrix will contain the comumns 'database','uniqueIdentifier','entryName','proteinName','sequence' and further columns depending on argument |
UniprSep |
(character) separators for further separating entry-fields if |
strictSpecPattern |
(logical or character) pattern for recognizing EntryName which is typically preceeding ProteinName (separated by ' '); if |
cleanCols |
(logical) remove columns with all entries NA, if |
silent |
(logical) suppress messages |
callFrom |
(character) allows easier tracking of messages produced |
debug |
(logical) supplemental messages for debugging |
This function returns (depending on argument tableOut
) a simple character vector (of sequences) with (entire) Uniprot annotation as name or
b) a matrix with columns: 'database','uniqueIdentifier','entryName','proteinName','sequence' and further columns depending on argument UniprSep
writeFasta2
for writing as fasta; for reading scan
or read.fasta
from the package seqinr
## Tiny example with common contaminants path1 <- system.file('extdata', package='wrProteo') fiNa <- "conta1.fasta.gz" fasta1 <- readFasta2(file.path(path1, fiNa)) ## now let's read and further separate annotation-fields fasta2 <- readFasta2(file.path(path1, fiNa), tableOut=TRUE) str(fasta1)
## Tiny example with common contaminants path1 <- system.file('extdata', package='wrProteo') fiNa <- "conta1.fasta.gz" fasta1 <- readFasta2(file.path(path1, fiNa)) ## now let's read and further separate annotation-fields fasta2 <- readFasta2(file.path(path1, fiNa), tableOut=TRUE) str(fasta1)
This function allows importing protein identification and quantification results from Fragpipe
which were previously exported as tabulated text (tsv). Quantification data and other relevant information will be extracted similar like the other import-functions from this package.
The final output is a list containing the elements: $annot
, $raw
and $quant
, or a data.frame with the quantication data and a part of the annotation if argument separateAnnot=FALSE
.
readFragpipeFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "Intensity$", annotCol = NULL, refLi = NULL, separateAnnot = TRUE, FDRCol = list("Protein.Probability", lim = 0.99), groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), plotGraph = TRUE, titGraph = "FragPipe", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
readFragpipeFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "Intensity$", annotCol = NULL, refLi = NULL, separateAnnot = TRUE, FDRCol = list("Protein.Probability", lim = 0.99), groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), plotGraph = TRUE, titGraph = "FragPipe", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method, defaults to |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA (thus avoid -Inf in log2 results) |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
annotCol |
(character) column names to be read/extracted for the annotation section (default c("Accession","Description","Gene","Contaminant","Sum.PEP.Score","Coverage....","X..Peptides","X..PSMs","X..Unique.Peptides", "X..AAs","MW..kDa.") ) |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
FDRCol |
(list) optional indication to search for protein FDR information |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
plotGraph |
(logical or integer) optional plot of type vioplot of initial and normalized data (using |
titGraph |
(character) custom title to plot of distribution of quantitation values |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second & third elements may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files; however, if |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function has been developed using Fragpipe versions 18.0 and 19.0.
Using the argument suplAnnotFile
it is possible to specify a specific file (or search for default file) to read for extracting file-names as sample-names and other experiment related information.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readMaxQuantFile
, readProtDiscovFile
, readProlineFile
FPproFi1 <- "tinyFragpipe1.tsv.gz" path1 <- system.file("extdata", package="wrProteo") ## let's define the main species and allow tagging some contaminants specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="MOUSE") dataFP <- readFragpipeFile(path1, file=FPproFi1, specPref=specPref1, tit="Tiny Fragpipe Data") summary(dataFP$quant)
FPproFi1 <- "tinyFragpipe1.tsv.gz" path1 <- system.file("extdata", package="wrProteo") ## let's define the main species and allow tagging some contaminants specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="MOUSE") dataFP <- readFragpipeFile(path1, file=FPproFi1, specPref=specPref1, tit="Tiny Fragpipe Data") summary(dataFP$quant)
This function allows importing initial petide identification and quantification results from Ionbot
which were exported as tabulated tsv can be imported and relevant information extracted.
The final output is a list containing 3 main elements: $annot
, $raw
and optional $quant
, or returns data.frame with entire content of file if separateAnnot=FALSE
.
readIonbotPeptides( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, gr = NULL, sdrf = NULL, read0asNA = TRUE, quantCol = "^Abundances*", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), plotGraph = TRUE, suplAnnotFile = TRUE, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = "Ionbot", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), silent = FALSE, debug = FALSE, callFrom = NULL )
readIonbotPeptides( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, gr = NULL, sdrf = NULL, read0asNA = TRUE, quantCol = "^Abundances*", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), plotGraph = TRUE, suplAnnotFile = TRUE, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = "Ionbot", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method, defaults to |
sampleNames |
(character) new column-names for quantification data (ProteomeDiscoverer does not automatically use file-names from spectra); this argument has priority over |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second & third elements may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
annotCol |
(character) column names to be read/extracted for the annotation section (default c("Accession","Description","Gene","Contaminant","Sum.PEP.Score","Coverage....","X..Peptides","X..PSMs","X..Unique.Peptides", "X..AAs","MW..kDa.") ) |
contamCol |
(character or integer, length=1) which columns should be used for contaminants marked by ProteomeDiscoverer.
If a column named |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
FDRCol |
(list) optional indication to search for protein FDR information |
plotGraph |
(logical or integer) optional plot of type vioplot of initial and normalized data (using |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by ProteomeDiscoverer; however, if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
titGraph |
(character) depreciated custom title to plot, please use 'tit' |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
Using the argument suplAnnotFile
it is possible to specify a specific file (or search for default file) to read for extracting file-names as sample-names
and other experiment realted information.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
/link[utils]{read.table}
, /link{readMaxQuantFile}
, /link{readProteomeDiscovererFile}
, /link[wrMisc]{normalizeThis}
)
path1 <- system.file("extdata", package="wrProteo") fiIonbot <- "tinyIonbotFile1.tsv.gz" datIobPep <- readIonbotPeptides(fiIonbot, path=path1)
path1 <- system.file("extdata", package="wrProteo") fiIonbot <- "tinyIonbotFile1.tsv.gz" datIobPep <- readIonbotPeptides(fiIonbot, path=path1)
Quantification results using MassChroQ should be initially treated using the R-package MassChroqR (both distributed by the PAPPSO at http://pappso.inrae.fr/) for initial normalization on peptide-level and combination of peptide values into protein abundances.
readMassChroQFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, refLi = NULL, separateAnnot = TRUE, titGraph = "MassChroQ", wex = NULL, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = FALSE, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
readMassChroQFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, refLi = NULL, separateAnnot = TRUE, titGraph = "MassChroQ", wex = NULL, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = FALSE, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read (may be tsv, csv, rda or rdata); both US and European csv formats are supported |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method (will be sent to |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
titGraph |
(character) custom title to plot of distribution of quantitation values |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second & third elements may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by ProteomeDiscoverer; however, if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
plotGraph |
(logical) optional plot of type vioplot of initial and normalized data (using |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
The final output of this fucntion is a list containing 3 elements: $annot
, $raw
, $quant
and $notes
, or returns data.frame with entire content of file if separateAnnot=FALSE
. Other list-elements remain empty to keep format compatible to other import functions.
This function has been developed using MassChroQ version 2.2 and R-package MassChroqR version 0.4.0. Both are distributed by the PAPPSO (http://pappso.inrae.fr/). When saving quantifications generated in R as RData (with extension .rdata or .rda) using the R-packages associated with MassChroq, the ABUNDANCE_TABLE produced by mcq.get.compar(XICAB) should be used.
After import data get (re-)normalized according to normalizeMeth
and refLi
, and boxplots or vioplots drawn.
This function returns list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readProlineFile
path1 <- system.file("extdata", package="wrProteo") fiNa <- "tinyMC.RData" dataMC <- readMassChroQFile(file=fiNa, path=path1)
path1 <- system.file("extdata", package="wrProteo") fiNa <- "tinyMC.RData" dataMC <- readMassChroQFile(file=fiNa, path=path1)
Protein quantification results from MaxQuant can be read using this function and relevant information extracted. Input files compressed as .gz can be read as well. The protein abundance values (XIC), peptide counting information like number of unique razor-peptides or PSM values and sample-annotation (if available) can be extracted, too. The protein abundance values may be normalized using multiple methods (median normalization as default), the determination of normalization factors can be restricted to specific proteins (normalization to bait protein(s), or to invariable matrix of spike-in experiments). The protein annotation data gets parsed to extract specific fields (ID, name, description, species ...). Besides, a graphical display of the distribution of protein abundance values may be generated before and after normalization.
readMaxQuantFile( path, fileName = "proteinGroups.txt", normalizeMeth = "median", quantCol = "LFQ.intensity", contamCol = "Potential.contaminant", pepCountCol = c("Razor + unique peptides", "Unique peptides", "MS.MS.count"), read0asNA = TRUE, refLi = NULL, sampleNames = NULL, extrColNames = c("Majority.protein.IDs", "Fasta.headers", "Number.of.proteins"), specPref = c(conta = "conta|CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), remRev = TRUE, remConta = FALSE, separateAnnot = TRUE, gr = NULL, sdrf = NULL, suplAnnotFile = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = NULL, wex = 1.6, plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
readMaxQuantFile( path, fileName = "proteinGroups.txt", normalizeMeth = "median", quantCol = "LFQ.intensity", contamCol = "Potential.contaminant", pepCountCol = c("Razor + unique peptides", "Unique peptides", "MS.MS.count"), read0asNA = TRUE, refLi = NULL, sampleNames = NULL, extrColNames = c("Majority.protein.IDs", "Fasta.headers", "Number.of.proteins"), specPref = c(conta = "conta|CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), remRev = TRUE, remConta = FALSE, separateAnnot = TRUE, gr = NULL, sdrf = NULL, suplAnnotFile = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = NULL, wex = 1.6, plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
path |
(character) path of file to be read |
fileName |
(character) name of file to be read (default 'proteinGroups.txt' as typically generated by MaxQuant in txt folder). Gz-compressed files can be read, too. |
normalizeMeth |
(character) normalization method, defaults to |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
contamCol |
(character or integer, length=1) which columns should be used for contaminants |
pepCountCol |
(character) pattern to search among column-names for count data (1st entry for 'Razor + unique peptides', 2nd fro 'Unique peptides', 3rd for 'MS.MS.count' (PSM)) |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA (thus avoid -Inf in log2 results) |
refLi |
(character or integer) custom specify which line of data should be used for normalization, ie which line is main species; if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
extrColNames |
(character) column names to be read (1st position: prefix for LFQ quantitation, default 'LFQ.intensity'; 2nd: column name for protein-IDs, default 'Majority.protein.IDs'; 3rd: column names of fasta-headers, default 'Fasta.headers', 4th: column name for number of protein IDs matching, default 'Number.of.proteins') |
specPref |
(character) prefix to identifiers allowing to separate i) recognize contamination database, ii) species of main identifications and iii) spike-in species |
remRev |
(logical) option to remove all protein-identifications based on reverse-peptides |
remConta |
(logical) option to remove all proteins identified as contaminants |
separateAnnot |
(logical) if |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second & third elements may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by MaxQuant; if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to May contain |
titGraph |
(character) custom title to plot of distribution of quantitation values |
wex |
(numeric) relative expansion factor of the violin in plot |
plotGraph |
(logical) optional plot vioplot of initial and normalized data (using |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
MaxQuant is proteomics quantification software provided by the MaxPlanck institute.
By default MaxQuant writes the results of each run to the path combined/txt
, from there (only) the files
'proteinGroups.txt' (main quantitation at protein level), 'summary.txt' and 'parameters.txt' will be used.
Meta-data describing the samples and experimental setup may be available from two sources :
a) The file summary.txt
which gets produced by MaxQuant in the same folder as the main quantification data.
b) Furthermore, meta-data deposited as sdrf
at Pride can be retreived (via the respective github page) when giving the accession number in argument sdrf
.
Then, the meta-data will be examined for determining groups of replicates and
the results thereof can be found in $sampleSetup$levels.
Alternatively, a dataframe formatted like sdrf-files (ie for each sample a separate line, see also function readSdrf
) may be given.
In tricky cases it is also possible to precise the column-name to use for defining the groups of replicates or the method for automatically choosing
the most suited column via the 2nd value of the argument sdrf
.
Please note, that sdrf is still experimental and only a small fraction of proteomics-data on Pride have been annotated accordingly.
If a valid sdrf is furnished, it's information has priority over the information extracted from the MaxQuant produced file summary.txt.
This import-function has been developed using MaxQuant versions 1.6.10.x to 2.0.x, the format of the resulting file 'proteinGroups.txt' is typically well conserved between versions.
The final output is a list containing these elements: $raw
, $quant
, $annot
, $counts
, $sampleSetup
, $quantNotes
, $notes
, or (if separateAnnot=FALSE
) data.frame
with annotation- and main quantification-content. If sdrf
information has been found, an add-tional list-element setup
will be added containg the entire meta-data as setup$meta
and the suggested organization as setup$lev
.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
(columns ), $counts
an array with 'PSM' and 'NoOfRazorPeptides',
$quantNotes
, $notes
and optional setup
for meta-data from sdrf
; or a data.frame with quantitation and annotation if separateAnnot=FALSE
read.table
, normalizeThis
) , readProteomeDiscovererFile
; readProlineFile
(and other imprtfunctions), matrixNAinspect
path1 <- system.file("extdata", package="wrProteo") # Here we'll load a short/trimmed example file (thus not the MaxQuant default name) fiNa <- "proteinGroupsMaxQuant1.txt.gz" specPr <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spike="HUMAN_UPS") dataMQ <- readMaxQuantFile(path1, file=fiNa, specPref=specPr, tit="tiny MaxQuant") summary(dataMQ$quant) matrixNAinspect(dataMQ$quant, gr=gl(3,3))
path1 <- system.file("extdata", package="wrProteo") # Here we'll load a short/trimmed example file (thus not the MaxQuant default name) fiNa <- "proteinGroupsMaxQuant1.txt.gz" specPr <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spike="HUMAN_UPS") dataMQ <- readMaxQuantFile(path1, file=fiNa, specPref=specPr, tit="tiny MaxQuant") summary(dataMQ$quant) matrixNAinspect(dataMQ$quant, gr=gl(3,3))
Peptide level identification and quantification data produced by MaxQuant can be read using this function and relevant information extracted. Input files compressed as .gz can be read as well. The peptide abundance values (XIC), peptide counting information and sample-annotation (if available) can be extracted, too.
readMaxQuantPeptides( path, fileName = "peptides.txt", normalizeMeth = "median", quantCol = "Intensity", contamCol = "Potential.contaminant", pepCountCol = "Experiment", refLi = NULL, sampleNames = NULL, extrColNames = c("Sequence", "Proteins", "Leading.razor.protein", "Start.position", "End.position", "Mass", "Missed.cleavages", "Unique..Groups.", "Unique..Proteins.", "Charges"), specPref = c(conta = "conta|CON_|LYSC_CHICK", mainSpecies = "HUMAN"), remRev = TRUE, remConta = FALSE, separateAnnot = TRUE, gr = NULL, sdrf = NULL, suplAnnotFile = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = NULL, wex = 1.6, plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
readMaxQuantPeptides( path, fileName = "peptides.txt", normalizeMeth = "median", quantCol = "Intensity", contamCol = "Potential.contaminant", pepCountCol = "Experiment", refLi = NULL, sampleNames = NULL, extrColNames = c("Sequence", "Proteins", "Leading.razor.protein", "Start.position", "End.position", "Mass", "Missed.cleavages", "Unique..Groups.", "Unique..Proteins.", "Charges"), specPref = c(conta = "conta|CON_|LYSC_CHICK", mainSpecies = "HUMAN"), remRev = TRUE, remConta = FALSE, separateAnnot = TRUE, gr = NULL, sdrf = NULL, suplAnnotFile = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = NULL, wex = 1.6, plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
path |
(character) path of file to be read |
fileName |
(character) name of file to be read (default 'peptides.txt' as typically generated by MaxQuant in txt folder). Gz-compressed files can be read, too. |
normalizeMeth |
(character) normalization method (for details see |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
contamCol |
(character or integer, length=1) which columns should be used for contaminants |
pepCountCol |
(character) pattern to search among column-names for count data (defaults to 'Experiment') |
refLi |
(character or integer) custom specify which line of data should be used for normalization, ie which line is main species; if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
extrColNames |
(character) column names to be read (1st position: prefix for quantitation, default 'intensity'; 2nd: column name for peptide-IDs, default ) |
specPref |
(character) prefix to identifiers allowing to separate i) recognize contamination database, ii) species of main identifications and iii) spike-in species |
remRev |
(logical) option to remove all peptide-identifications based on reverse-peptides |
remConta |
(logical) option to remove all peptides identified as contaminants |
separateAnnot |
(logical) if |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of
replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second & third elements may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by MaxQuant; if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
titGraph |
(character) custom title to plot |
wex |
(numeric) relative expansion factor of the violin in plot |
plotGraph |
(logical) optional plot vioplot of initial and normalized data (using |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
The peptide annotation data gets parsed to extract specific fields (ID, name, description, species ...). Besides, a graphical display of the distribution of peptide abundance values may be generated before and after normalization.
MaxQuant is proteomics quantification software provided by the MaxPlanck institute.
By default MaxQuant write the results of each run to the path combined/txt
, from there (only) the files
'peptides.txt' (main quantitation at peptide level), 'summary.txt' and 'parameters.txt' will be used for this function.
Meta-data describing the samples and experimental setup may be available from two sources :
a) The file summary.txt
which gets produced by MaxQuant in the same folder as the main quantification data.
b) Furthermore, meta-data deposited as sdrf
at Pride can be retreived (via the respective github page) when giving
the accession number in argument sdrf
.
Then, the meta-data will be examined for determining groups of replicates and
the results thereof can be found in $sampleSetup$levels.
Alternatively, a dataframe formatted like sdrf-files (ie for each sample a separate line, see also function readSdrf
) may be given.
In tricky cases it is also possible to precise the column-name to use for defining the groups of replicates or the method for automatically choosing
the most suited column via the 2nd value of the argument sdrf
, see also the function defineSamples
(which gets used internally).
Please note, that sdrf is still experimental and only a small fraction of proteomics-data on Pride have been annotated accordingly.
If a valid sdrf is furnished, it's information has priority over the information extracted from the MaxQuant produced file summary.txt.
This function has been developed using MaxQuant versions 1.6.10.x to 2.0.x, the format of the resulting file 'peptides.txt'
is typically well conserved between versions.
The final output is a list containing these elements: $raw
, $quant
, $annot
, $counts
, $sampleSetup
,
$quantNotes
, $notes
, or (if separateAnnot=FALSE
) data.frame
with annotation- and main quantification-content. If sdrf
information has been found, an add-tional list-element setup
will be added containg the entire meta-data as setup$meta
and the suggested organization as setup$lev
.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
(columns ), $counts
an array with 'PSM' and 'NoOfRazorPeptides',
$quantNotes
, $notes
and optional setup
for meta-data from sdrf
; or a data.frame with quantitation and annotation if separateAnnot=FALSE
read.table
, normalizeThis
), for reading protein level readMaxQuantFile
, readProlineFile
# Here we'll load a short/trimmed example file (thus not the MaxQuant default name) MQpepFi1 <- "peptides_tinyMQ.txt.gz" path1 <- system.file("extdata", package="wrProteo") specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spec2="HUMAN") dataMQpep <- readMaxQuantPeptides(path1, file=MQpepFi1, specPref=specPref1, tit="Tiny MaxQuant Peptides") summary(dataMQpep$quant)
# Here we'll load a short/trimmed example file (thus not the MaxQuant default name) MQpepFi1 <- "peptides_tinyMQ.txt.gz" path1 <- system.file("extdata", package="wrProteo") specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spec2="HUMAN") dataMQpep <- readMaxQuantPeptides(path1, file=MQpepFi1, specPref=specPref1, tit="Tiny MaxQuant Peptides") summary(dataMQpep$quant)
Protein quantification results form OpenMS
which were exported as .csv
can be imported and relevant information extracted.
Peptide data get summarized by protein by top3 or sum methods.
The final output is a list containing the elements: $annot
, $raw
, $quant
ie normaized final quantifications, or returns data.frame with entire content of file if separateAnnot=FALSE
.
readOpenMSFile( fileName = NULL, path = NULL, normalizeMeth = "median", refLi = NULL, sampleNames = NULL, quantCol = "Intensity", sumMeth = "top3", minPepNo = 1, protNaCol = "ProteinName", separateAnnot = TRUE, plotGraph = TRUE, tit = "OpenMS", wex = 1.6, specPref = c(conta = "LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), silent = FALSE, debug = FALSE, callFrom = NULL )
readOpenMSFile( fileName = NULL, path = NULL, normalizeMeth = "median", refLi = NULL, sampleNames = NULL, quantCol = "Intensity", sumMeth = "top3", minPepNo = 1, protNaCol = "ProteinName", separateAnnot = TRUE, plotGraph = TRUE, tit = "OpenMS", wex = 1.6, specPref = c(conta = "LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method (will be sent to |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
sampleNames |
(character) new column-names for quantification data (by default the names from files with spectra will be used) |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
sumMeth |
(character) method for summarizing peptide data (so far 'top3' and 'sum' available) |
minPepNo |
(integer) minumun number of peptides to be used for retruning quantification |
protNaCol |
(character) column name to be read/extracted for the annotation section (default "ProteinName") |
separateAnnot |
(logical) if |
plotGraph |
(logical) optional plot of type vioplot of initial and normalized data (using |
tit |
(character) custom title to plot |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
This function has been developed based on the OpenMS peptide-identification and label-free-quantification module. Csv input files may also be compresses as .gz.
Note: With this version the information about protein-modifications (PTMs) may not yet get exploited fully.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
,$expSetup
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readMaxQuantFile
, readProlineFile
, readProtDiscovFile
path1 <- system.file("extdata", package="wrProteo") fiNa <- "OpenMS_tiny.csv.gz" dataOM <- readOpenMSFile(file=fiNa, path=path1, tit="tiny OpenMS example") summary(dataOM$quant)
path1 <- system.file("extdata", package="wrProteo") fiNa <- "OpenMS_tiny.csv.gz" dataOM <- readOpenMSFile(file=fiNa, path=path1, tit="tiny OpenMS example") summary(dataOM$quant)
Quantification results from Proline Proline and MS-Angel exported as xlsx format can be read directly using this function.
Besides, files in tsv, csv (European and US format) or tabulated txt can be read, too.
Then relevant information gets extracted, the data can optionally normalized and displayed as boxplot or vioplot.
The final output is a list containing 6 elements: $raw
, $quant
, $annot
, $counts
, $quantNotes
and $notes
.
Alternatively, a data.frame with annotation and quantitation data may be returned if separateAnnot=FALSE
.
Note: There is no normalization by default since quite frequently data produced by Proline are already sufficiently normalized.
The figure produced using the argument plotGraph=TRUE
may help judging if the data appear sufficiently normalized (distribtions should align).
readProlineFile( fileName, path = NULL, normalizeMeth = "median", logConvert = TRUE, sampleNames = NULL, quantCol = "^abundance_", annotCol = c("accession", "description", "is_validated", "protein_set_score", "X.peptides", "X.specific_peptides"), remStrainNo = TRUE, pepCountCol = c("^psm_count_", "^peptides_count_"), trimColnames = FALSE, refLi = NULL, separateAnnot = TRUE, plotGraph = TRUE, titGraph = NULL, wex = 2, specPref = c(conta = "_conta\\|", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = TRUE, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), silent = FALSE, callFrom = NULL, debug = FALSE )
readProlineFile( fileName, path = NULL, normalizeMeth = "median", logConvert = TRUE, sampleNames = NULL, quantCol = "^abundance_", annotCol = c("accession", "description", "is_validated", "protein_set_score", "X.peptides", "X.specific_peptides"), remStrainNo = TRUE, pepCountCol = c("^psm_count_", "^peptides_count_"), trimColnames = FALSE, refLi = NULL, separateAnnot = TRUE, plotGraph = TRUE, titGraph = NULL, wex = 2, specPref = c(conta = "_conta\\|", mainSpecies = "OS=Homo sapiens"), gr = NULL, sdrf = NULL, suplAnnotFile = TRUE, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), silent = FALSE, callFrom = NULL, debug = FALSE )
fileName |
(character) name of file to read; .xlsx-, .csv-, .txt- and .tsv can be read (csv, txt and tsv may be gz-compressed). Reading xlsx requires package 'readxl'. |
path |
(character) optional path (note: Windows backslash sould be protected or written as '/') |
normalizeMeth |
(character) normalization method (for details and options see |
logConvert |
(logical) convert numeric data as log2, will be placed in $quant |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
quantCol |
(character or integer) colums with main quantitation-data : precise colnames to extract, or if length=1 content of |
annotCol |
(character) precise colnames or if length=1 pattern to search among column-names for $annot |
remStrainNo |
(logical) if |
pepCountCol |
(character) pattern to search among column-names for count data of PSM and NoOfPeptides |
trimColnames |
(logical) optional trimming of column-names of any redundant characters from beginning and end |
refLi |
(integer) custom decide which line of data is main species, if single character entry it will be used to choose a group of species (eg 'mainSpe') |
separateAnnot |
(logical) separate annotation form numeric data (quantCol and annotCol must be defined) |
plotGraph |
(logical or matrix of integer) optional plot vioplot of initial data; if integer, it will be passed to |
titGraph |
(character) custom title to plot of distribution of quantitation values |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second & third elements may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by quantification software; however, if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
silent |
(logical) suppress messages |
callFrom |
(character) allow easier tracking of messages produced |
debug |
(logical) display additional messages for debugging |
This function has been developed using Proline version 1.6.1 coupled with MS-Angel 1.6.1. The classical way of using ths function consists in exporting results produced by Proline and MS-Angel as xlsx file. Besides, other formats may be read, too. This includes csv (eg the main sheet/table of ths xlsx exported file saved as csv). WOMBAT represents an effort to automatize quantitative proteomics experiments, using this route data get exported as txt files which can be read, too.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
(columns ), $counts
an array with 'PSM' and 'NoOfPeptides', $quantNotes
and $notes
; or a data.frame with quantitation and annotation if separateAnnot=FALSE
path1 <- system.file("extdata", package="wrProteo") fiNa <- "exampleProlineABC.csv.gz" dataABC <- readProlineFile(path=path1, file=fiNa) summary(dataABC$quant)
path1 <- system.file("extdata", package="wrProteo") fiNa <- "exampleProlineABC.csv.gz" dataABC <- readProlineFile(path=path1, file=fiNa) summary(dataABC$quant)
This function has been depreciated and replaced by readProteomeDiscovererPeptides
(from this package).
readProtDiscovererPeptides(...)
readProtDiscovererPeptides(...)
... |
Actually, this function doesn't ready any input any more |
This function returns NULL
readProteomeDiscovererFile
, readProteomeDiscovererPeptides
Depreciated old version of Protein identification and quantification results from
Thermo ProteomeDiscoverer
which were exported as tabulated text can be imported and relevant information extracted.
The final output is a list containing 3 elements: $annot
, $raw
and optional $quant
,
or returns data.frame with entire content of file if separateAnnot=FALSE
.
Please use readProteomeDiscovererFile() from the same package instead !
readProtDiscovFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "^Abundances*", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), gr = NULL, sdrf = NULL, suplAnnotFile = TRUE, groupPref = list(lowNumberOfGroups = TRUE), specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), plotGraph = TRUE, wex = 1.6, titGraph = "Proteome Discoverer", silent = FALSE, debug = FALSE, callFrom = NULL )
readProtDiscovFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "^Abundances*", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), gr = NULL, sdrf = NULL, suplAnnotFile = TRUE, groupPref = list(lowNumberOfGroups = TRUE), specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), plotGraph = TRUE, wex = 1.6, titGraph = "Proteome Discoverer", silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method, defaults to |
sampleNames |
(character) custom column-names for quantification data (ProteomeDiscoverer does not automatically use file-names from spectra); this argument has priority over |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
annotCol |
(character) column names to be read/extracted for the annotation section (default c("Accession","Description","Gene","Contaminant","Sum.PEP.Score","Coverage....","X..Peptides","X..PSMs","X..Unique.Peptides", "X..AAs","MW..kDa.") ) |
contamCol |
(character or integer, length=1) which columns should be used for contaminants marked by ProteomeDiscoverer.
If a column named |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
FDRCol |
(list) optional indication to search for protein FDR information |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second element may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by ProteomeDiscoverer; however, if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
plotGraph |
(logical or integer) optional plot of type vioplot of initial and normalized data (using |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
titGraph |
(character) custom title to plot of distribution of quantitation values |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function has been replaced by readProteomeDiscovererFile
(from the same package) !
The syntax and strcuture of output has remained the same, you can simply replace the name of the function called.
This function has been developed using Thermo ProteomeDiscoverer versions 2.2 to 2.5.
The format of resulting files at export also depends which columns are chosen as visible inside ProteomeDiscoverer and subsequently get chosen for export.
Using the argument suplAnnotFile
it is possible to specify a specific file (or search for default file) to read for extracting file-names as sample-names and other experiment realted information.
If a column named contamCol
is found, the data will be lateron filtered to remove all contaminants, set to NULL
for keeping all contaminants
This function replaces the depreciated function readPDExport
.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readMaxQuantFile
, readProlineFile
, readFragpipeFile
path1 <- system.file("extdata", package="wrProteo") fiNa <- "tinyPD_allProteins.txt.gz" ## Please use the function readProteinDiscovererFile(), as shown below (same syntax) dataPD <- readProteomeDiscovererFile(file=fiNa, path=path1, suplAnnotFile=FALSE) summary(dataPD$quant)
path1 <- system.file("extdata", package="wrProteo") fiNa <- "tinyPD_allProteins.txt.gz" ## Please use the function readProteinDiscovererFile(), as shown below (same syntax) dataPD <- readProteomeDiscovererFile(file=fiNa, path=path1, suplAnnotFile=FALSE) summary(dataPD$quant)
Depreciated old version of Peptide identification and quantification results from Thermo ProteomeDiscoverer
which were exported as tabulated text can be imported and relevant information extracted.
The final output is a list containing 3 elements: $annot
, $raw
and optional $quant
, or returns data.frame with entire content of file if separateAnnot=FALSE
.
readProtDiscovPeptides( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, suplAnnotFile = TRUE, gr = NULL, sdrf = NULL, read0asNA = TRUE, quantCol = "^Abundances*", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), plotGraph = TRUE, titGraph = "Proteome Discoverer", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), silent = FALSE, debug = FALSE, callFrom = NULL )
readProtDiscovPeptides( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, suplAnnotFile = TRUE, gr = NULL, sdrf = NULL, read0asNA = TRUE, quantCol = "^Abundances*", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), plotGraph = TRUE, titGraph = "Proteome Discoverer", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method, defaults to |
sampleNames |
(character) new column-names for quantification data (ProteomeDiscoverer does not automatically use file-names from spectra); this argument has priority over |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by ProteomeDiscoverer; however, if |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second element may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
annotCol |
(character) column names to be read/extracted for the annotation section (default c("Accession","Description","Gene","Contaminant","Sum.PEP.Score","Coverage....","X..Peptides","X..PSMs","X..Unique.Peptides", "X..AAs","MW..kDa.") ) |
contamCol |
(character or integer, length=1) which columns should be used for contaminants marked by ProteomeDiscoverer.
If a column named |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
FDRCol |
(list) optional indication to search for protein FDR information |
plotGraph |
(logical or integer) optional plot of type vioplot of initial and normalized data (using |
titGraph |
(character) depreciated custom title to plot, please use 'tit' |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function has been developed using Thermo ProteomeDiscoverer versions 2.2 to 2.5.
The format of resulting files at export also depends which columns are chosen as visible inside ProteomeDiscoverer and subsequently get chosen for export.
Using the argument suplAnnotFile
it is possible to specify a specific file (or search for default file) to read for extracting file-names as sample-names and other experiment realted information.
Precedent and following aminoacids (relative to identified protease recognition sites) will be removed form peptide sequences and be displayed in $annot as columns 'prec' and 'foll'.
If a column named contamCol
is found, the data will be lateron filtered to remove all contaminants, set to NULL
for keeping all contaminants
This function replaces the depreciated function readPDExport
.
Besides, ProteomeDiscoverer version number and full raw-file path will be extracted for $notes in final output.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readMaxQuantFile
, readProteomeDiscovererFile
path1 <- system.file("extdata", package="wrProteo")
path1 <- system.file("extdata", package="wrProteo")
Protein identification and quantification results from Thermo ProteomeDiscoverer which were exported as tabulated text can be imported and relevant information extracted.
readProteomeDiscovererFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "^Abundance", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), gr = NULL, sdrf = NULL, suplAnnotFile = TRUE, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), plotGraph = TRUE, wex = 1.6, titGraph = "Proteome Discoverer", silent = FALSE, debug = FALSE, callFrom = NULL )
readProteomeDiscovererFile( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, read0asNA = TRUE, quantCol = "^Abundance", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), gr = NULL, sdrf = NULL, suplAnnotFile = TRUE, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), plotGraph = TRUE, wex = 1.6, titGraph = "Proteome Discoverer", silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method, defaults to |
sampleNames |
(character) custom column-names for quantification data (ProteomeDiscoverer does not automatically use file-names from spectra); this argument has priority over |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA |
quantCol |
(character or integer) define ywhich columns should be extracted as quantitation data : The argument may be the exact column-names to be used, or if length=1
content of |
annotCol |
(character) column names to be read/extracted for the annotation section (default c("Accession","Description","Gene","Contaminant","Sum.PEP.Score","Coverage....","X..Peptides","X..PSMs","X..Unique.Peptides", "X..AAs","MW..kDa.") ) |
contamCol |
(character or integer, length=1) which columns should be used for contaminants marked by ProteomeDiscoverer.
If a column named |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
FDRCol |
(list) optional indication to search for protein FDR information |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second & third elements may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by ProteomeDiscoverer; however, if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
plotGraph |
(logical or integer) optional plot of type vioplot of initial and normalized data (using |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
titGraph |
(character) custom title to plot of distribution of quantitation values |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function has been developed using Thermo ProteomeDiscoverer versions 2.2 to 2.5.
The format of resulting files at export also depends which columns are chosen as visible inside ProteomeDiscoverer and subsequently get chosen for export.
Using the argument suplAnnotFile
it is possible to specify a specific file (or search for default file) to read for extracting file-names as sample-names and other experiment realted information.
If a column named contamCol
is found, the data will be lateron filtered to remove all contaminants, set to NULL
for keeping all contaminants.
The final output is a list containing as (main) elements: $annot
, $raw
and optional $quant
,
or returns data.frame with entire content of file if separateAnnot=FALSE
.
This function replaces the depreciated function readProtDiscovFile
which will soon be retracted from this package.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readMaxQuantFile
, readProlineFile
, readFragpipeFile
path1 <- system.file("extdata", package="wrProteo") fiNa <- "tinyPD_allProteins.txt.gz" dataPD <- readProteomeDiscovererFile(file=fiNa, path=path1, suplAnnotFile=FALSE) summary(dataPD$quant)
path1 <- system.file("extdata", package="wrProteo") fiNa <- "tinyPD_allProteins.txt.gz" dataPD <- readProteomeDiscovererFile(file=fiNa, path=path1, suplAnnotFile=FALSE) summary(dataPD$quant)
Initials petide identificationa and quantification results form Thermo ProteomeDiscoverer
which were exported as tabulated text can be imported and relevant information extracted.
The final output is a list containing 3 elements: $annot
, $raw
and optional $quant
, or returns data.frame with entire content of file if separateAnnot=FALSE
.
readProteomeDiscovererPeptides( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, suplAnnotFile = TRUE, gr = NULL, sdrf = NULL, read0asNA = TRUE, quantCol = "^Abundances*", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), plotGraph = TRUE, titGraph = "Proteome Discoverer", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), silent = FALSE, debug = FALSE, callFrom = NULL )
readProteomeDiscovererPeptides( fileName, path = NULL, normalizeMeth = "median", sampleNames = NULL, suplAnnotFile = TRUE, gr = NULL, sdrf = NULL, read0asNA = TRUE, quantCol = "^Abundances*", annotCol = NULL, contamCol = "Contaminant", refLi = NULL, separateAnnot = TRUE, FDRCol = list(c("^Protein.FDR.Confidence", "High"), c("^Found.in.Sample.", "High")), plotGraph = TRUE, titGraph = "Proteome Discoverer", wex = 1.6, specPref = c(conta = "CON_|LYSC_CHICK", mainSpecies = "OS=Homo sapiens"), groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read |
path |
(character) path of file to be read |
normalizeMeth |
(character) normalization method, defaults to |
sampleNames |
(character) new column-names for quantification data (ProteomeDiscoverer does not automatically use file-names from spectra); this argument has priority over |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by ProteomeDiscoverer; however, if |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
the second element may give futher indicatations for automatic organization of groups of replicates.
Besides, the output from |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
annotCol |
(character) column names to be read/extracted for the annotation section (default c("Accession","Description","Gene","Contaminant","Sum.PEP.Score","Coverage....","X..Peptides","X..PSMs","X..Unique.Peptides", "X..AAs","MW..kDa.") ) |
contamCol |
(character or integer, length=1) which columns should be used for contaminants marked by ProteomeDiscoverer.
If a column named |
refLi |
(character or integer) custom specify which line of data is main species, if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
separateAnnot |
(logical) if |
FDRCol |
(list) optional indication to search for protein FDR information |
plotGraph |
(logical or integer) optional plot of type vioplot of initial and normalized data (using |
titGraph |
(character) depreciated custom title to plot, please use 'tit' |
wex |
(integer) relative expansion factor of the violin-plot (will be passed to |
specPref |
(character or list) define characteristic text for recognizing (main) groups of species (1st for comtaminants - will be marked as 'conta', 2nd for main species- marked as 'mainSpe',
and optional following ones for supplemental tags/species - maked as 'species2','species3',...);
if list and list-element has multiple values they will be used for exact matching of accessions (ie 2nd of argument |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function has been developed using Thermo ProteomeDiscoverer versions 2.2 to 2.5.
The format of resulting files at export also depends which columns are chosen as visible inside ProteomeDiscoverer and subsequently get chosen for export.
Using the argument suplAnnotFile
it is possible to specify a specific file (or search for default file) to read for extracting file-names as sample-names and other experiment realted information.
Precedent and following aminoacids (relative to identified protease recognition sites) will be removed form peptide sequences and be displayed in $annot as columns 'prec' and 'foll'.
If a column named contamCol
is found, the data will be lateron filtered to remove all contaminants, set to NULL
for keeping all contaminants
This function replaces the depreciated function readPDExport
.
Besides, ProteomeDiscoverer version number and full raw-file path will be extracted for $notes in final output.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
, $counts
an array with number of peptides, $quantNotes
and $notes
; or if separateAnnot=FALSE
the function returns a data.frame with annotation and quantitation only
read.table
, normalizeThis
) , readMaxQuantFile
, readProteomeDiscovererFile
path1 <- system.file("extdata", package="wrProteo")
path1 <- system.file("extdata", package="wrProteo")
Sample/experimental annotation meta-data form MaxQuant, ProteomeDiscoverer, FragPipe, Proline or similar, can be read using this function and relevant information extracted. Furthermore, annotation in sdrf-format can be added (the order of sdrf will be adjated automatically, if possible). This functions returns a list with grouping of samples into replicates and additional information gathered. Input files compressed as .gz can be read as well.
readSampleMetaData( quantMeth, sdrf = NULL, suplAnnotFile = NULL, path = ".", abund = NULL, groupPref = list(lowNumberOfGroups = TRUE, sampleNames = NULL, gr = NULL), chUnit = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
readSampleMetaData( quantMeth, sdrf = NULL, suplAnnotFile = NULL, path = ".", abund = NULL, groupPref = list(lowNumberOfGroups = TRUE, sampleNames = NULL, gr = NULL), chUnit = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
quantMeth |
(character, length=1) quantification method used; 2-letter abbreviations like 'MQ','PD','PL','FP' etc may be used |
sdrf |
(character, list or data.frame) optional extraction and adding of experimenal meta-data:
if character, this may be the ID at ProteomeExchange or a similarly formatted local file. |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by MaxQuant; if |
path |
(character) optional path of file(s) to be read |
abund |
(matrix or data.frame) experimental quantitation data; only column-names will be used for aligning order of annotated samples |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates);
May contain |
chUnit |
(logical or character) optional adjustig of group-labels from sample meta-data in case multipl different unit-prefixes are used to single common prefix
(eg adjust '100pMol' and '1nMol' to '100pMol' and '1000pMol') for better downstream analysis. This option will call |
silent |
(logical) suppress messages if |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
When initally reading/importing quantitation data, typically very little is known about the setup of different samples in the underlying experiment. The overall aim is to read and mine the corresponding sample-annotation documeneted by the quantitation-software and/or from n sdrf repository and to attach it to the experimental data. This way, in subsequent steps of analysis (eg PCA, statictical tests) the user does not have to bother stuying the experimental setup to figure out which samples should be considered as relicate of whom.
Sample annotation meta-data can be obtained from two sources : a) form additional files produced (and exported) by the initial quantitation software (so far MaxQuant and ProteomeDiscoverer have een implemeneted) or b) from the universal sdrf-format (from Pride or user-supplied). Both types can be imported and checked in the same run, if valid sdrf-information is found this will be given priority. For more information about the sdrf format please see sdrf on github.
This function returns a list with $level
(grouping of samples given as integer), and $meth
(method by which grouping as determined).
If valid sdrf
was given, the resultant list contains in addition $sdrfDat
(data.frame of annotation).
Alternatively it may contain a $sdrfExport
if sufficient information has been gathered (so far only for MaxQuant) for a draft sdrf for export (that should be revised and completed by the user).
If software annotation has been found it will be shown in $annotBySoft
.
If all entries are invalid or entries do not pass the tests, this functions returns an empty list
.
this function is used internally by readMaxQuantFile
,/link{readProteomeDiscovererFile}
etc; uses readSdrf
for reading sdrf-files, replicateStructure
for mining annotation columns
sdrf001819Setup <- readSampleMetaData(quantMeth=NA, sdrf="PXD001819") str(sdrf001819Setup)
sdrf001819Setup <- readSampleMetaData(quantMeth=NA, sdrf="PXD001819") str(sdrf001819Setup)
This function allows reading proteomics meta-data from sdrf file, as they are provided on https://github.com/bigbio/proteomics-sample-metadata. A data.frame containing all annotation data will be returned. To stay conform with the (non-obligatory) recommendations, columnnames are shown as lower caps.
readSdrf( fi, chCol = "auto", urlPrefix = "github", silent = FALSE, callFrom = NULL, debug = FALSE )
readSdrf( fi, chCol = "auto", urlPrefix = "github", silent = FALSE, callFrom = NULL, debug = FALSE )
fi |
(character) main input; may be full path or url to the file with meta-annotation. If a short project-name is given,
it will be searched based at the location of |
chCol |
(character, length=1) optional checking of column-names |
urlPrefix |
(character, length=1) prefix to add to search when no complete path or url is given on |
silent |
(logical) suppress messages |
callFrom |
(character) allows easier tracking of messages produced |
debug |
(logical) display additional messages for debugging |
The packages utils and wrMisc must be installed.
Please note that reading sdrf files (if not provided as local copy) will take a few seconds, depending on the responsiveness of github.
This function only handles the main reading of sdrf data and some diagnostic checks.
For mining sdrf data please look at replicateStructure
and readSampleMetaData
.
This function returns the content of sdrf-file as data.frame (or NULL
if the corresponding file was not found)
readSampleMetaData
, replicateStructure
,
## This may take a few sconds... sdrf001819 <- readSdrf("PXD001819") str(sdrf001819)
## This may take a few sconds... sdrf001819 <- readSdrf("PXD001819") str(sdrf001819)
This function allows reading and importing genomic UCSC-annotation data.
Files can be read as default UCSC exprot or as GTF-format.
In the context of proteomics we noticed that sometimes UniProt tables from UCSC are hard to match to identifiers from UniProt Fasta-files, ie many protein-identifiers won't match.
For this reason additional support is given to reading 'Genes and Gene Predictions': Since this table does not include protein-identifiers, a non-redundant list of ENSxxx transcript identifiers
can be exprted as file for an additional stop of conversion, eg using a batch conversion tool at the site of UniProt.
The initial genomic annotation can then be complemented using readUniProtExport
.
Using this more elaborate route, we found higher coverage when trying to add genomic annotation to protein-identifiers to proteomics results with annnotation based on an initial Fasta-file.
readUCSCtable( fiName, exportFileNa = NULL, gtf = NA, simplifyCols = c("gene_id", "chr", "start", "end", "strand", "frame"), silent = FALSE, debug = FALSE, callFrom = NULL )
readUCSCtable( fiName, exportFileNa = NULL, gtf = NA, simplifyCols = c("gene_id", "chr", "start", "end", "strand", "frame"), silent = FALSE, debug = FALSE, callFrom = NULL )
fiName |
(character) name (and path) of file to read |
exportFileNa |
(character) optional file-name to be exported, if |
gtf |
(logical) specify if file |
simplifyCols |
(character) optional list of column-names to be used for simplification (if 6 column-headers are given) : the 1st value will be used to identify the column used as refence to summarize all lines with this ID; for the 2nd (typically chromosome names) will be taken a representative value, for the 3rd (typically gene start site) will be taken the minimum, for the 4th (typically gene end site) will be taken the maximum, for the 5th and 6th a representative values will be reported; |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
This function returns a matrix, optionally the file 'exportFileNa' may be written
path1 <- system.file("extdata", package="wrProteo") gtfFi <- file.path(path1, "UCSC_hg38_chr11extr.gtf.gz") # here we'll write the file for UniProt conversion to tempdir() to keep things tidy expFi <- file.path(tempdir(), "deUcscForUniProt2.txt") UcscAnnot1 <- readUCSCtable(gtfFi, exportFileNa=expFi) ## results can be further combined with readUniProtExport() deUniProtFi <- file.path(path1, "deUniProt_hg38chr11extr.tab") deUniPr1 <- readUniProtExport(deUniProtFi, deUcsc=UcscAnnot1, targRegion="chr11:1-135,086,622") deUniPr1[1:5,-5]
path1 <- system.file("extdata", package="wrProteo") gtfFi <- file.path(path1, "UCSC_hg38_chr11extr.gtf.gz") # here we'll write the file for UniProt conversion to tempdir() to keep things tidy expFi <- file.path(tempdir(), "deUcscForUniProt2.txt") UcscAnnot1 <- readUCSCtable(gtfFi, exportFileNa=expFi) ## results can be further combined with readUniProtExport() deUniProtFi <- file.path(path1, "deUniProt_hg38chr11extr.tab") deUniPr1 <- readUniProtExport(deUniProtFi, deUcsc=UcscAnnot1, targRegion="chr11:1-135,086,622") deUniPr1[1:5,-5]
This function allows reading and importing protein-ID conversion results from UniProt.
To do so, first copy/paste your query IDs into UniProt 'Retrieve/ID mapping' field called '1. Provide your identifiers' (or upload as file), verify '2. Select options'.
In a typical case of 'enst000xxx' IDs you may leave default settings, ie 'Ensemble Transcript' as input and 'UniProt KB' as output. Then, 'Submit' your search and retreive results via
'Download', you need to specify a 'Tab-separated' format ! If you download as 'Compressed' you need to decompress the .gz file before running the function readUCSCtable
In addition, a file with UCSC annotation (Ensrnot accessions and chromosomic locations, obtained using readUCSCtable
) can be integrated.
readUniProtExport( UniProtFileNa, deUcsc = NULL, targRegion = NULL, useUniPrCol = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
readUniProtExport( UniProtFileNa, deUcsc = NULL, targRegion = NULL, useUniPrCol = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
UniProtFileNa |
(character) name (and path) of file exported from Uniprot (tabulated text file inlcuding headers) |
deUcsc |
(data.frame) object produced by |
targRegion |
(character or list) optional marking of chromosomal locations to be part of a given chromosomal target region,
may be given as character like |
useUniPrCol |
(character) optional declaration which colums from UniProt exported file should be used/imported (default 'EnsID','Entry','Entry.name','Status','Protein.names','Gene.names','Length'). |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
In a typicall use case, first chromosomic location annotation is extracted from UCSC for the species of interest and imported to R using readUCSCtable
.
However, the tables provided by UCSC don't contain Uniprot IDs. Thus, an additional (batch-)conversion step needs to get added.
For this reason readUCSCtable
allows writing a file with Ensemble transcript IDs which can be converted tu UniProt IDs at the site of UniProt.
Then, UniProt annotation (downloaded as tab-separated) can be imported and combined with the genomic annotation using this function.
This function returns a data.frame (with columns $EnsID, $Entry, $Entry.name, $Status, $Protein.names, $Gene.names, $Length; if deUcsc
is integrated plus: $chr, $type, $start, $end, $score, $strand, $Ensrnot, $avPos)
path1 <- system.file("extdata",package="wrProteo") deUniProtFi <- file.path(path1,"deUniProt_hg38chr11extr.tab") deUniPr1a <- readUniProtExport(deUniProtFi) str(deUniPr1a) ## Workflow starting with UCSC annotation (gtf) files : gtfFi <- file.path(path1,"UCSC_hg38_chr11extr.gtf.gz") UcscAnnot1 <- readUCSCtable(gtfFi) ## Results of conversion at UniProt are already available (file "deUniProt_hg38chr11extr.tab") myTargRegion <- list("chr1", pos=c(198110001,198570000)) myTargRegion2 <-"chr11:1-135,086,622" # works equally well deUniPr1 <- readUniProtExport(deUniProtFi,deUcsc=UcscAnnot1, targRegion=myTargRegion) ## Now UniProt IDs and genomic locations are both available : str(deUniPr1)
path1 <- system.file("extdata",package="wrProteo") deUniProtFi <- file.path(path1,"deUniProt_hg38chr11extr.tab") deUniPr1a <- readUniProtExport(deUniProtFi) str(deUniPr1a) ## Workflow starting with UCSC annotation (gtf) files : gtfFi <- file.path(path1,"UCSC_hg38_chr11extr.gtf.gz") UcscAnnot1 <- readUCSCtable(gtfFi) ## Results of conversion at UniProt are already available (file "deUniProt_hg38chr11extr.tab") myTargRegion <- list("chr1", pos=c(198110001,198570000)) myTargRegion2 <-"chr11:1-135,086,622" # works equally well deUniPr1 <- readUniProtExport(deUniProtFi,deUcsc=UcscAnnot1, targRegion=myTargRegion) ## Now UniProt IDs and genomic locations are both available : str(deUniPr1)
Protein quantification results from Wombat-P using the Bioconductor package Normalizer can be read using this function and relevant information extracted. Input files compressed as .gz can be read as well. The protein abundance values (XIC), peptide counting get extracted. Since protein annotation is not very extensive with this format of data, the function allows reading the initial fasta files (from the directory above the quantitation-results) allowing to extract more protein-annotation (like species). Sample-annotation (if available) can be extracted from sdrf files, which are typically part of the Wombat output, too. The protein abundance values may be normalized using multiple methods (median normalization as default), the determination of normalization factors can be restricted to specific proteins (normalization to bait protein(s), or to invariable matrix of spike-in experiments). The protein annotation data gets parsed to extract specific fields (ID, name, description, species ...). Besides, a graphical display of the distribution of protein abundance values may be generated before and after normalization.
readWombatNormFile( fileName, path = NULL, quantSoft = "(quant software not specified)", fasta = NULL, isLog2 = TRUE, normalizeMeth = "none", quantCol = "abundance_", contamCol = NULL, pepCountCol = c("number_of_peptides"), read0asNA = TRUE, refLi = NULL, sampleNames = NULL, extrColNames = c("protein_group"), specPref = NULL, remRev = TRUE, remConta = FALSE, separateAnnot = TRUE, gr = NULL, sdrf = NULL, suplAnnotFile = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = NULL, wex = 1.6, plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
readWombatNormFile( fileName, path = NULL, quantSoft = "(quant software not specified)", fasta = NULL, isLog2 = TRUE, normalizeMeth = "none", quantCol = "abundance_", contamCol = NULL, pepCountCol = c("number_of_peptides"), read0asNA = TRUE, refLi = NULL, sampleNames = NULL, extrColNames = c("protein_group"), specPref = NULL, remRev = TRUE, remConta = FALSE, separateAnnot = TRUE, gr = NULL, sdrf = NULL, suplAnnotFile = NULL, groupPref = list(lowNumberOfGroups = TRUE, chUnit = TRUE), titGraph = NULL, wex = 1.6, plotGraph = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
fileName |
(character) name of file to be read (default 'proteinGroups.txt' as typically generated by Compomics in txt folder). Gz-compressed files can be read, too. |
path |
(character) path of file to be read |
quantSoft |
(character) qunatification-software used inside Wombat-P |
fasta |
(logical or character) if |
isLog2 |
(logical) typically data read from Wombat are expected to be |
normalizeMeth |
(character) normalization method, defaults to |
quantCol |
(character or integer) exact col-names, or if length=1 content of |
contamCol |
(character or integer, length=1) which columns should be used for contaminants |
pepCountCol |
(character) pattern to search among column-names for count data (1st entry for 'Razor + unique peptides', 2nd fro 'Unique peptides', 3rd for 'MS.MS.count' (PSM)) |
read0asNA |
(logical) decide if initial quntifications at 0 should be transformed to NA (thus avoid -Inf in log2 results) |
refLi |
(character or integer) custom specify which line of data should be used for normalization, ie which line is main species; if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given |
sampleNames |
(character) custom column-names for quantification data; this argument has priority over |
extrColNames |
(character) column names to be read (1st position: prefix for LFQ quantitation, default 'LFQ.intensity'; 2nd: column name for protein-IDs, default 'Majority.protein.IDs'; 3rd: column names of fasta-headers, default 'Fasta.headers', 4th: column name for number of protein IDs matching, default 'Number.of.proteins') |
specPref |
(character) prefix to identifiers allowing to separate i) recognize contamination database, ii) species of main identifications and iii) spike-in species |
remRev |
(logical) option to remove all protein-identifications based on reverse-peptides |
remConta |
(logical) option to remove all proteins identified as contaminants |
separateAnnot |
(logical) if |
gr |
(character or factor) custom defined pattern of replicate association, will override final grouping of replicates from |
sdrf |
(logical, character, list or data.frame) optional extraction and adding of experimenal meta-data:
if |
suplAnnotFile |
(logical or character) optional reading of supplemental files produced by Compomics; if |
groupPref |
(list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to |
titGraph |
(character) custom title to plot of distribution of quantitation values |
wex |
(numeric) relative expansion factor of the violin in plot |
plotGraph |
(logical) optional plot vioplot of initial and normalized data (using |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
By standard workflow of Wombat-P writes the results of each analysis-method/quantification-algorithm as .csv files Meta-data describing the proteins may be available from two sources : a) The 1st column of the Wombat/normalizer output. b) Form the .fasta file in the directory above the analysis/quantiication results of the Wombar-workflow
Meta-data describing the samples and experimental setup may be available from a sdrf-file (from the directory above the analysis/quantiication results)
If available, the meta-data will be examined for determining groups of replicates and
the results thereof can be found in $sampleSetup$levels.
Alternatively, a dataframe formatted like sdrf-files (ie for each sample a separate line, see also function readSdrf
) may be given, too.
This import-function has been developed using Wombat-P version 1.x.
The final output is a list containing these elements: $raw
, $quant
, $annot
, $counts
, $sampleSetup
, $quantNotes
, $notes
, or (if separateAnnot=FALSE
) data.frame
with annotation- and main quantification-content. If sdrf
information has been found, an add-tional list-element setup
will be added containg the entire meta-data as setup$meta
and the suggested organization as setup$lev
.
This function returns a list with $raw
(initial/raw abundance values), $quant
with final normalized quantitations, $annot
(columns ), $counts
an array with 'PSM' and 'NoOfRazorPeptides',
$quantNotes
, $notes
and optional setup
for meta-data from sdrf
; or a data.frame with quantitation and annotation if separateAnnot=FALSE
read.table
, normalizeThis
) , readProteomeDiscovererFile
; readProlineFile
(and other import-functions), matrixNAinspect
path1 <- system.file("extdata", package="wrProteo") # Here we'll load a short/trimmed example file (originating from Compomics) fiNa <- "tinyWombCompo1.csv.gz" dataWB <- readWombatNormFile(file=fiNa, path=path1, tit="tiny Wombat/Compomics, Normalized ") summary(dataWB$quant)
path1 <- system.file("extdata", package="wrProteo") # Here we'll load a short/trimmed example file (originating from Compomics) fiNa <- "tinyWombCompo1.csv.gz" dataWB <- readWombatNormFile(file=fiNa, path=path1, tit="tiny Wombat/Compomics, Normalized ") summary(dataWB$quant)
Remove samples (ie columns) from every instance of list of matrixes. Note: This function assumes same order of columns in list-elements 'listElem' !
removeSampleInList( dat, remSamp, listElem = c("raw", "quant", "counts", "sampleSetup"), silent = FALSE, debug = FALSE, callFrom = NULL )
removeSampleInList( dat, remSamp, listElem = c("raw", "quant", "counts", "sampleSetup"), silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(list) main input to be filtered |
remSamp |
(integer) column number to exclude |
listElem |
(character) names of list-elements where columns indicated with 'remSamp' should be removed |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
This function returns a matrix including imputed values or list of final and matrix with number of imputed by group (plus optional plot)
set.seed(2019) datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste("li",1:50,sep=""), letters[19:24])) datL <- list(raw=datT6, quant=datT6, annot=matrix(nrow=nrow(datT6), ncol=2)) datDelta2 <- removeSampleInList(datL, remSam=2)
set.seed(2019) datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste("li",1:50,sep=""), letters[19:24])) datL <- list(raw=datT6, quant=datT6, annot=matrix(nrow=nrow(datT6), ncol=2)) datDelta2 <- removeSampleInList(datL, remSam=2)
This function helps replacing missing EntryNames (in $annot) after reading quantification results.
To do so the comumn-names of annCol
will be used :
The content of 2nd element (and optional 3rd element) will be used to replace missing content in column defined by 1st element.
replMissingProtNames( x, annCol = c("EntryName", "Accession", "SpecType"), silent = FALSE, debug = FALSE, callFrom = NULL )
replMissingProtNames( x, annCol = c("EntryName", "Accession", "SpecType"), silent = FALSE, debug = FALSE, callFrom = NULL )
x |
(list) output of |
annCol |
(character) the column-names form |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
This function returns a list (like as input), but with missing elments of $annot completed (if available in other columns)
readMaxQuantFile
, readProtDiscovFile
, readProlineFile
dat <- list(quant=matrix(sample(11:99,9,replace=TRUE), ncol=3), annot=cbind(EntryName=c( "YP010_YEAST","",""),Accession=c("A5Z2X5","P01966","P35900"), SpecType=c("Yeast",NA,NA))) replMissingProtNames(dat)
dat <- list(quant=matrix(sample(11:99,9,replace=TRUE), ncol=3), annot=cbind(EntryName=c( "YP010_YEAST","",""),Accession=c("A5Z2X5","P01966","P35900"), SpecType=c("Yeast",NA,NA))) replMissingProtNames(dat)
Get/convert short names of various proteomics quantitation software names. A 2-letter abbreviation will be returned
shortSoftwName( x, tryAsLower = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
shortSoftwName( x, tryAsLower = TRUE, silent = FALSE, debug = FALSE, callFrom = NULL )
x |
(character) 'mono' or 'average' |
tryAsLower |
(logical) |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allow easier tracking of messages produced |
This function returns a vector with masses for all amino-acids (argument 'massTy' to switch from mono-isotopic to average mass)
shortSoftwName(c("maxquant","DIANN"))
shortSoftwName(c("maxquant","DIANN"))
This function takes statistical testing results (obtained using testRobustToNAimputation
or moderTest2grp
,
based on limma) and calculates specifcity and sensitivity values for plotting ROC-curves along a panel of thresholds.
Based on annotation (from test$annot) with the user-defined column for species (argument 'spec') the counts of TP (true positives), FP (false positves), FN (false negatives) and TN are determined.
In addition, an optional plot may be produced.
summarizeForROC( test, useComp = 1, tyThr = "BH", thr = NULL, columnTest = NULL, FCthrs = NULL, spec = c("H", "E", "S"), annotCol = "Species", filterMat = "filter", batchMode = FALSE, tit = NULL, color = 1, plotROC = TRUE, pch = 1, bg = NULL, overlPlot = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
summarizeForROC( test, useComp = 1, tyThr = "BH", thr = NULL, columnTest = NULL, FCthrs = NULL, spec = c("H", "E", "S"), annotCol = "Species", filterMat = "filter", batchMode = FALSE, tit = NULL, color = 1, plotROC = TRUE, pch = 1, bg = NULL, overlPlot = FALSE, silent = FALSE, debug = FALSE, callFrom = NULL )
test |
(list or class |
useComp |
(character or integer) in case multiple comparisons (ie multiple columns 'test$tyThr'); which pairwise comparison to used |
tyThr |
(character,length=1) type of statistical test-result to be used for sensitivity and specificity calculations (eg 'BH','lfdr' or 'p.value'), must be list-element of 'test' |
thr |
(numeric) stat test (FDR/p-value) threshold, if |
columnTest |
depreciated, please use 'useComp' instead |
FCthrs |
(numeric) Fold-Change threshold (display as line) give as Fold-change and NOT as log2(FC), default at 1.5, set to |
spec |
(character) labels for those species which should be matched to column |
annotCol |
(character, length=1) column name of |
filterMat |
(character) name (or index) of element of |
batchMode |
(logical) if |
tit |
(character) optinal custom title in graph |
color |
(character or integer) color in graph |
plotROC |
(logical) toogle plot on or off |
pch |
(integer) type of symbol to be used (see |
bg |
(character) backgroud in plot (see |
overlPlot |
(logical) overlay to existing plot if |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
Determining TP and FP counts requires 'ground trouth' experiments, where it is known in advance which proteins are expected to change abundance between two groups of samples. Typically this is done by mixing proteins of different species origin, the first species noted by argument 'spec' designes the species to be considered constant (expected as FN in statistical tests). Then, one or mutiple additional spike-in species can be defined. As the spike-in cocentration should have been altered between different gruops of samples, they are expected as TP.
The main aim of this function consists in providing specifcity and sensitivity values, plus counts of TP (true positives), FP (false positves), FN (false negatives) and TN (true negatives), along various thrsholds (specified in column 'alph') for statistical tests preformed prior to calling this function.
Note, that the choice of species-annotation plays a crucial role who the counting results are obtained. In case of multiple spike-in species the user should pay attention if they all are expected to change abundance at the same ratio. If not, it is advised to run this function multiple times sperately only with the subset of those species expected to change at same ratio.
The dot on the plotted curve shows the results at the level of the single threshold alpha=0.05.
For plotting multiple ROC curves as overlay and additional graphical parameters/options you may use plotROC
.
See also ROC on Wkipedia for explanations of TP,FP,FN and TN as well as examples. Note that numerous other packages also provide support for building and plotting ROC-curves : Eg rocPkgShort, ROCR, pROC or ROCit
This function returns a numeric matrix containing the columns 'alph', 'spec', 'sens', 'prec', 'accur', 'FD' plus two columns with absolute numbers of lines (genes/proteins) passing the current threshold level alpha (1st species, all other species)
replot the figure using plotROC
, calculate AUC using AucROC
, robust test for preparing tables testRobustToNAimputation
, moderTest2grp
, test2grp
, eBayes
in package limma, t.test
set.seed(2019); test1 <- list(annot=cbind(Species=c(rep("b",35), letters[sample.int(n=3, size=150, replace=TRUE)])), BH=matrix(c(runif(35,0,0.01), runif(150)), ncol=1)) tail(roc1 <- summarizeForROC(test1, spec=c("a","b","c"), annotCol="Species"))
set.seed(2019); test1 <- list(annot=cbind(Species=c(rep("b",35), letters[sample.int(n=3, size=150, replace=TRUE)])), BH=matrix(c(runif(35,0,0.01), runif(150)), ncol=1)) tail(roc1 <- summarizeForROC(test1, spec=c("a","b","c"), annotCol="Species"))
test2grp
performs t-test on two groups of data using limma,
this is a custom implementation of moderTest2grp
for proteomics.
The final obkect also includes the results without moderation by limma
(eg BH-FDR in $nonMod.BH
).
Furthermore, there is an option to make use of package ROTS (note, this will increase the time of computatins considerably).
test2grp( dat, questNo, useCol = NULL, grp = NULL, annot = NULL, ROTSn = 0, silent = FALSE, debug = FALSE, callFrom = NULL )
test2grp( dat, questNo, useCol = NULL, grp = NULL, annot = NULL, ROTSn = 0, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(matrix or data.frame) main data (may contain NAs) |
questNo |
(integer) specify here which question, ie comparison should be adressed |
useCol |
(integer or character) |
grp |
(character or factor) |
annot |
(matrix or data.frame) |
ROTSn |
(integer) number of iterations ROTS runs (stabilization of reseults may be seen with >300) |
silent |
(logical) suppress messages |
debug |
(logical) display additional messages for debugging |
callFrom |
(character) allow easier tracking of message(s) produced |
This function returns a limma-type S3 object of class 'MArrayLM' (which can be accessed like a list); multiple testing correction types or modified testing by ROTS may get included ('p.value','FDR','BY','lfdr' or 'ROTS.BH')
moderTest2grp
, pVal2lfdr
, t.test
, ROTS
from the Bioconductor package ROTS
set.seed(2018); datT8 <- matrix(round(rnorm(800)+3,1), nc=8, dimnames=list(paste( "li",1:100,sep=""), paste(rep(LETTERS[1:3],c(3,3,2)),letters[18:25],sep=""))) datT8[3:6,1:2] <- datT8[3:6,1:2] +3 # augment lines 3:6 (c-f) datT8[5:8,5:6] <- datT8[5:8,5:6] +3 # augment lines 5:8 (e-h) grp8 <- gl(3,3,labels=LETTERS[1:3],length=8) datL <- list(data=datT8, filt= wrMisc::presenceFilt(datT8,grp=grp8,maxGrpM=1,ratMa=0.8)) testAvB0 <- wrMisc::moderTest2grp(datT8[,1:6], gl(2,3)) testAvB <- test2grp(datL, questNo=1)
set.seed(2018); datT8 <- matrix(round(rnorm(800)+3,1), nc=8, dimnames=list(paste( "li",1:100,sep=""), paste(rep(LETTERS[1:3],c(3,3,2)),letters[18:25],sep=""))) datT8[3:6,1:2] <- datT8[3:6,1:2] +3 # augment lines 3:6 (c-f) datT8[5:8,5:6] <- datT8[5:8,5:6] +3 # augment lines 5:8 (e-h) grp8 <- gl(3,3,labels=LETTERS[1:3],length=8) datL <- list(data=datT8, filt= wrMisc::presenceFilt(datT8,grp=grp8,maxGrpM=1,ratMa=0.8)) testAvB0 <- wrMisc::moderTest2grp(datT8[,1:6], gl(2,3)) testAvB <- test2grp(datL, questNo=1)
This function replaces NA
values based on group neighbours (based on grouping of columns in argument gr
), following overall assumption of close to Gaussian distribution.
Furthermore, it is assumed that NA
-values originate from experimental settings where measurements at or below detection limit are recoreded as NA
.
In such cases (eg in proteomics) it is current practice to replace NA
-values by very low (random) values in order to be able to perform t-tests.
However, random normal values used for replacing may in rare cases deviate from the average (the 'assumed' value) and in particular, if multiple NA
replacements are above the average,
may look like induced biological data and be misinterpreted as so.
The statistical testing uses eBayes
from Bioconductor package limma for robust testing in the context of small numbers of replicates.
By repeating multiple times the process of replacing NA
-values and subsequent testing the results can be sumarized afterwards by median over all repeated runs to remmove the stochastic effect of individual NA-imputation.
Thus, one may gain stability towards random-character of NA
imputations by repeating imputation & test 'nLoop' times and summarize p-values by median (results stabilized at 50-100 rounds).
It is necessary to define all groups of replicates in gr
to obtain all possible pair-wise testing (multiple columns in $BH, $lfdr etc).
The modified testing-procedure of Bioconductor package ROTS may optionaly be included, if desired.
This function returns a limma-like S3 list-object further enriched by additional fields/elements.
testRobustToNAimputation( dat, gr = NULL, annot = NULL, retnNA = TRUE, avSd = c(0.15, 0.5), avSdH = NULL, plotHist = FALSE, xLab = NULL, tit = NULL, imputMethod = "mode2", seedNo = NULL, multCorMeth = NULL, nLoop = 100, lfdrInclude = NULL, ROTSn = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
testRobustToNAimputation( dat, gr = NULL, annot = NULL, retnNA = TRUE, avSd = c(0.15, 0.5), avSdH = NULL, plotHist = FALSE, xLab = NULL, tit = NULL, imputMethod = "mode2", seedNo = NULL, multCorMeth = NULL, nLoop = 100, lfdrInclude = NULL, ROTSn = NULL, silent = FALSE, debug = FALSE, callFrom = NULL )
dat |
(matrix or data.frame) main data (may contain |
gr |
(character or factor) replicate association; if |
annot |
(matrix or data.frame) annotation (lines must match lines of data !), if |
retnNA |
(logical) retain and report number of |
avSd |
(numerical,length=2) population characteristics (mean and sd) for >1 |
avSdH |
depreciated, please use |
plotHist |
(logical) additional histogram of original, imputed and resultant distribution (made using |
xLab |
(character) custom x-axis label |
tit |
(character) custom title |
imputMethod |
(character) choose the imputation method (may be 'mode2'(default), 'mode1', 'datQuant', 'modeAdopt', 'informed' or 'none', for details see |
seedNo |
(integer) seed-value for normal random values |
multCorMeth |
(character) define which method(s) for correction of multipl testing should be run (for choice : 'BH','lfdr','BY','tValTab', choosing several is possible) |
nLoop |
(integer) number of runs of independent |
lfdrInclude |
(logical) depreciated, please used |
ROTSn |
(integer) depreciated, please used |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
callFrom |
(character) This function allows easier tracking of messages produced |
The argument multCorMeth
allows to choose which multiple correction algorimths will be used and included to the final results.
Possible options are 'lfdr','BH','BY','tValTab', ROTSn='100' (name to element necessary) or 'noLimma' (to add initial p.values and BH to limma-results). By default 'lfdr' (local false discovery rate from package 'fdrtools') and 'BH' (Benjamini-Hochberg FDR) are chosen.
The option 'BY' referrs to Benjamini-Yakuteli FDR, 'tValTab' allows exporting all individual t-values from the repeated NA-substitution and subsequent testing.
This function is compatible with automatic extraction of experimental setup based on sdrf or other quantitation-specific sample annotation.
In this case, the results of automated importing and mining of sample annotation should be stored as $sampleSetup$groups
or $sampleSetup$lev
For details 'on choice of NA-impuation procedures with arguments 'imputMethod' and 'avSd' please see matrixNAneighbourImpute
.
This function returns a limma-type S3 object of class 'MArrayLM' (which can be accessed lika a list); multiple results of testing or multiple testing correction types may get included ('p.value','FDR','BY','lfdr' or 'ROTS.BH')
NA-imputation via matrixNAneighbourImpute
, modereated t-test without NA-imputation moderTest2grp
, calculating lfdr pVal2lfdr
, eBayes
in Bioconductor package limma, t.test
,ROTS
of Bioconductor package ROTS
set.seed(2015); rand1 <- round(runif(600) +rnorm(600,1,2),3) dat1 <- matrix(rand1,ncol=6) + matrix(rep((1:100)/20,6),ncol=6) dat1[13:16,1:3] <- dat1[13:16,1:3] +2 # augment lines 13:16 dat1[19:20,1:3] <- dat1[19:20,1:3] +3 # augment lines 19:20 dat1[15:18,4:6] <- dat1[15:18,4:6] +1.4 # augment lines 15:18 dat1[dat1 <1] <- NA # mimick some NAs for low abundance ## normalize data boxplot(dat1, main="data before normalization") dat1 <- wrMisc::normalizeThis(as.matrix(dat1), meth="median") ## designate replicate relationships in samples ... grp1 <- gl(2, 3, labels=LETTERS[1:2]) ## moderated t-test with repeated inputations (may take >10 sec, >60 sec if ROTSn >0 !) PLtestR1 <- testRobustToNAimputation(dat=dat1, gr=grp1, retnNA=TRUE, nLoop=70) names(PLtestR1)
set.seed(2015); rand1 <- round(runif(600) +rnorm(600,1,2),3) dat1 <- matrix(rand1,ncol=6) + matrix(rep((1:100)/20,6),ncol=6) dat1[13:16,1:3] <- dat1[13:16,1:3] +2 # augment lines 13:16 dat1[19:20,1:3] <- dat1[19:20,1:3] +3 # augment lines 19:20 dat1[15:18,4:6] <- dat1[15:18,4:6] +1.4 # augment lines 15:18 dat1[dat1 <1] <- NA # mimick some NAs for low abundance ## normalize data boxplot(dat1, main="data before normalization") dat1 <- wrMisc::normalizeThis(as.matrix(dat1), meth="median") ## designate replicate relationships in samples ... grp1 <- gl(2, 3, labels=LETTERS[1:2]) ## moderated t-test with repeated inputations (may take >10 sec, >60 sec if ROTSn >0 !) PLtestR1 <- testRobustToNAimputation(dat=dat1, gr=grp1, retnNA=TRUE, nLoop=70) names(PLtestR1)
Please use VolcanoPlotW() from package wrGraph. This function does NOT produce a plot any more.
VolcanoPlotW2( Mvalue, pValue = NULL, useComp = 1, filtFin = NULL, ProjNa = NULL, FCthrs = NULL, FdrList = NULL, FdrThrs = NULL, FdrType = NULL, subTxt = NULL, grayIncrem = TRUE, col = NULL, pch = 16, compNa = NULL, batchFig = FALSE, cexMa = 1.8, cexLa = 1.1, limM = NULL, limp = NULL, annotColumn = NULL, annColor = NULL, cexPt = NULL, cexSub = NULL, cexTxLab = 0.7, namesNBest = NULL, NbestCol = 1, sortLeg = "descend", NaSpecTypeAsContam = TRUE, useMar = c(6.2, 4, 4, 2), returnData = FALSE, callFrom = NULL, silent = FALSE, debug = FALSE )
VolcanoPlotW2( Mvalue, pValue = NULL, useComp = 1, filtFin = NULL, ProjNa = NULL, FCthrs = NULL, FdrList = NULL, FdrThrs = NULL, FdrType = NULL, subTxt = NULL, grayIncrem = TRUE, col = NULL, pch = 16, compNa = NULL, batchFig = FALSE, cexMa = 1.8, cexLa = 1.1, limM = NULL, limp = NULL, annotColumn = NULL, annColor = NULL, cexPt = NULL, cexSub = NULL, cexTxLab = 0.7, namesNBest = NULL, NbestCol = 1, sortLeg = "descend", NaSpecTypeAsContam = TRUE, useMar = c(6.2, 4, 4, 2), returnData = FALSE, callFrom = NULL, silent = FALSE, debug = FALSE )
Mvalue |
(numeric or matrix) data to plot; M-values are typically calculated as difference of log2-abundance values and 'pValue' the mean of log2-abundance values;
M-values and p-values may be given as 2 columsn of a matrix, in this case the argument |
pValue |
(numeric, list or data.frame) if |
useComp |
(integer, length=1) choice of which of multiple comparisons to present in |
filtFin |
(matrix or logical) The data may get filtered before plotting: If |
ProjNa |
(character) custom title |
FCthrs |
(numeric) Fold-Change threshold (display as line) give as Fold-change and NOT log2(FC), default at 1.5, set to |
FdrList |
(numeric) FDR data or name of list-element |
FdrThrs |
(numeric) FDR threshold (display as line), default at 0.05, set to |
FdrType |
(character) FDR-type to extract if |
subTxt |
(character) custom sub-title |
grayIncrem |
(logical) if |
col |
(character) custom color(s) for points of plot (see also |
pch |
(integer) type of symbol(s) to plot (default=16) (see also |
compNa |
(character) names of groups compared |
batchFig |
(logical) if |
cexMa |
(numeric) font-size of title, as expansion factor (see also |
cexLa |
(numeric) size of axis-labels, as expansion factor (see also |
limM |
(numeric, length=2) range of axis M-values |
limp |
(numeric, length=2) range of axis FDR / p-values |
annotColumn |
(character) column names of annotation to be extracted (only if |
annColor |
(character or integer) colors for specific groups of annoatation (only if |
cexPt |
(numeric) size of points, as expansion factor (see also |
cexSub |
(numeric) size of subtitle, as expansion factor (see also |
cexTxLab |
(numeric) size of text-labels for points, as expansion factor (see also |
namesNBest |
(integer or character) number of best points to add names in figure; if 'passThr' all points passing FDR and FC-filtes will be selected;
if the initial object |
NbestCol |
(character or integer) colors for text-labels of best points |
sortLeg |
(character) sorting of 'SpecType' annotation either ascending ('ascend') or descending ('descend'), no sorting if |
NaSpecTypeAsContam |
(logical) consider lines/proteins with |
useMar |
(numeric,length=4) custom margings (see also |
returnData |
(logical) optional returning data.frame with (ID, Mvalue, pValue, FDRvalue, passFilt) |
callFrom |
(character) allow easier tracking of message(s) produced |
silent |
(logical) suppress messages |
debug |
(logical) additional messages for debugging |
deprecated - returns nothing
this function was replaced by plotPCAw
)
set.seed(2005); mat <- matrix(round(runif(900),2), ncol=9)
set.seed(2005); mat <- matrix(round(runif(900),2), ncol=9)
prot
.
This function also allows comparing the main vector of sequences with a reference vector ref
to check if any of the sequences therein are truncated.Write sequences in fasta format to file
This function writes sequences from character vector as fasta formatted file (from UniProt)
Line-headers are based on names of elements of input vector prot
.
This function also allows comparing the main vector of sequences with a reference vector ref
to check if any of the sequences therein are truncated.
writeFasta2( prot, fileNa = NULL, ref = NULL, lineLength = 60, eol = "\n", truSuf = "_tru", silent = FALSE, debug = FALSE, callFrom = NULL )
writeFasta2( prot, fileNa = NULL, ref = NULL, lineLength = 60, eol = "\n", truSuf = "_tru", silent = FALSE, debug = FALSE, callFrom = NULL )
prot |
(character) vector of sequenes, names will be used for fasta-header |
fileNa |
(character) name (and path) for file to be written |
ref |
(character) optional/additional set of (reference-) sequences (only for comparison to |
lineLength |
(integer, length=1) number of sequence characters per line (default 60, should be >1 and <10000) |
eol |
(character) the character(s) to print at the end of each line (row); for example, eol = "\r\n" will produce Windows' line endings on a Unix-alike OS |
truSuf |
(character) suffix to be added for sequences found truncated when comparing with |
silent |
(logical) suppress messages |
debug |
(logical) supplemental messages for debugging |
callFrom |
(character) allows easier tracking of messages produced |
Sequences without any names will be given generic headers like protein01 ... etc.
This function writes the sequences from prot
as fasta formatted-file
readFasta2
for reading fasta, write.fasta
from the package seqinr
prots <- c(SEQU1="ABCDEFGHIJKL", SEQU2="CDEFGHIJKLMNOP") writeFasta2(prots, fileNa=file.path(tempdir(),"testWrite.fasta"), lineLength=6)
prots <- c(SEQU1="ABCDEFGHIJKL", SEQU2="CDEFGHIJKLMNOP") writeFasta2(prots, fileNa=file.path(tempdir(),"testWrite.fasta"), lineLength=6)