RNA-Seq-dependent on是什么意思

Question: Should We Remove Duplicated Reads In Rna-Seq ?
5.7 years ago by
Medical Research Council, Oxford, United Kingdom
& 200 wrote:
Is it OK to remove duplicated reads in RNA-seq before testing for differential expression?
Not following
modified 4.6 years ago
5.7 years ago by
how do you define a duplicated read? If you delete duplicate reads, how will you get information on expression levels?
5.7 years ago by
Duplicated read is not the same a duplicated transcript. The level of natural read duplications will be a function of coverage and expression levels. And since most of the time the latter is not known we can only estimate the natural duplication rate.
modified 5.7 years ago
5.7 years ago by
Reads which are exactly the same and on the same strand. They get aligned at the same place and looks like multiple copies of one read. So, I assume the answer is NO?
5.7 years ago by
5.7 years ago by
Washington University, St Louis, USA
& 16k wrote:
The general consensus seems to be to NOT remove duplicates from RNA-seq. Read the biostar discussions:
and the other threads it links to.
modified 5.7 years ago
5.7 years ago by
The general consensus I see from those discussions, rather than "no", is "it depends". The first link: Malachi Griffith writes: "Observing high rates of read duplicates in RNA-seq libraries is common. It may not be an indication of poor library complexity caused by low sample input or over-amplification. It might be caused by such problems but it is often because of very high abundance of a small number of genes." Second link: Istvan Albert writes: "My personal opinion is to investigate the duplication rates and remove them if there is indication that these are artificial ones." 3rd link:
Ketil writes: "I think the number of duplicates depend on many factors, so it is hard to give any general and useful rules of thumb. Usually, duplicates are correlated with too little sample material, and/or difficulties in the lab." And the seqanswers thread goes both ways.
5.7 years ago by
Yes. I should have said that the general consensus seems to be to not blindly remove duplicates.
5.7 years ago by
But are type I/II errors from duplication rates expression/quartile dependent?
Coefficient of Variation is often the lowest in the most abundant genes, which are also the genes with the highest probability of duplicates (optical, amplification, or true duplicates) by virtue of the amount of data they consume. Variance is often artificially inflated for these genes in the end anyways in most regularization protocols for differential expression, so the conclusions from DE analyses should be largely the same... In contrast, genes with low expression stand to be the most affected by duplicates for this exact same reason (i.e. their variance is artificially reduced by regularization AND duplicates are a larger proportion of their count totals to begin with, thus increasing type I error). The odds of these arguments affecting opinions is low, but I think it's a conversation worth having.
I would recommend re-running differential expression analyses with/without deduplication and then comparing the resulting gene sets for each quartile of expression. My educated guess is that conclusions for the top quartile would be fairly consistent (do to the low CoV) while results could be quite different in genes with low expression where a larger percentage of the counts may be optical or technical duplicates (in comparison to genes in the top quartile).
Edit: rephrasing and grammar
modified 15 months ago
15 months ago by
This is an interesting comment and I'd like to understand it better. Are you using the word "regularization" in the statistical sense, for example as in "Tikhonov regularization"? If so, can you describe the statistical methods you have in mind? Also, how do you know that there are more duplicates per read for low-expressed genes?
11 months ago by
5.7 years ago by
State College, PA, USA
& 3.0k wrote:
I really depends. I have seen alignments with so many same reads at a certain position (5 and 3 prime exactly matching), that it were for sure PCR duplicates (on other positions in the alignment there was no such coverage). However, you might loose information about strongly expressed genes this way. I would check for un/equal coverage for every gene/contig.
modified 5.7 years ago
5.7 years ago by
5.7 years ago by
& 4.6k wrote:
If you have paired-end reads, I definitely think you should remove duplicates (alignments that start at the same locations at both read 1 and read 2). These are very unlikely to occur by chance because of the variation in fragment size.
If you have a small amount of RNA going into the experiment, you will have run a lot of PCR cycles before sequencing and the representation of some fragments will have become very biased. Duplicate removal is a way to mitigate this effect although it will not solve it.
It continues to baffle me why people keep saying that you have to keep duplicates because you will lose information - but apparently it's perfectly fine to get grossly distorted read counts because of amplification artifacts! There is no way to avoid bias completely.
5.7 years ago by
The problem is that your first premise that "These are very unlikely to occur by chance because of the variation in fragment size" is true for DNA but NOT necessarily true for RNA. If you have a short RNA (i.e., small transcript, miRNA, etc) that is very highly expressed there might be many, many legitimate duplicate copies of that RNA with exactly the same fragment size/position. The size of that transcript might even be such that effectively no fragmentation occurs. Also fragmentation may not occur in a totally random fashion across the transcript. By removing those duplicates you may be introducing a new source of "grossly distorted read counts". As, you say, there is no way to avoid bias completely. But, depending on the complexity and quality of your library you may be better off living with a small amount of amplification artifacts.
modified 5.7 years ago
5.7 years ago by
OK, but for miRNA etc you wouldn't do paired-e what I had in mind was the typical mRNA case. Anyway, your point is noted, and again underlines that one has to think about the biases, and not just blindly follow a rule, such as "always remove duplicates" or "never remove duplicates".
5.7 years ago by
5.7 years ago by
& 670 wrote:
Short answer - NO!
You are checking for differential expression of genes, number of reads mapping to a gene is a measure of its expression. By removing reads, you will bias the true expression measurements.
5.7 years ago by
I don't agree at all. By not removing duplicates, you bias the expression measurements towards PCR artifacts, especially from paired-end data.
5.7 years ago by
Well I have not worked with paired-end data so far, can't say about them. But I produced 'tag' files to track the number of copies of 'duplicate reads' for single-end seq. data. The very highly expressed genes tend to have more of such 'duplicate' reads. We also did validations with qPCR that confirm the higher expression. So I don't think you can say duplicate reads result purely from PCR amplification artifacts.
5.7 years ago by
OK, that's an interesting finding actually. Sure, duplicate reads are not always from PCR amplication artifacts, but sometimes they are! Probably you should ideally go on a case-by-case.
5.7 years ago by
But highly expressed genes will have higher number of reads in general - I would be very interested in ratios (dupl/all) for highly/medium/low expressed genes if you could provide this information please.
modified 4.8 years ago
4.8 years ago by
4.6 years ago by
United States
& 200 wrote:
The EXPRESS pipeline (berkeley) does this for you.
It calculates a coverage distribution and then removes the "spikes" from that distribtion.
Because duplicates are expected in RNA Seq, removing them "blindly" is a bad idea.
You should be smoothing, not removing.
4.6 years ago by
How much of an issue is read duplicates in RNA seq experiments, both single fragment and PE sequenced? Is it affecting a large proportion of the data? What are the duplicate rates like for an average human sample experiment? Is it in the order of 1% duplicates? Or 10% duplicates?
3.0 years ago by
The rate of natural duplication of the reads coming from a transcript will be a function of the length of the transcript, the expression levels of the transcript and the overall coverage of the sample. Hence there is no rule that you could use a priori to separate natural duplicates (that should be kept) from artificial duplicates (that should be removed)
3.0 years ago by
I have seen genome-wide duplication rates up to 50% in our own data (50bp SE) and think this is nothing unusual.
3.0 years ago by
I am a bench scientist and I have a basic question.
Can you remove duplicate reads from sin...
I followed this paper &Differential gene and transcript expression analysis of RNA-seq experiment...
Hi everyone,
I have some paired end RNA-seq samples that have high levels of duplication (some ...
I have a RNA Seq data (Illumina 1.9). I did QC using Fastqc after over represented sequences and ...
I want to call variants using whole genome RNA-seq datasets of wheat. I have generated BAM files ...
I have been wondering how to handle the pcr duplicate reads in single-end RNA-Seq or Chip-Seq dat...
I am working on pair-end RNA-seq data to get read counts by running &bedtools multicov& ...
Hi, I am doing RNA-seq. When i use RNA-SeQC to do the quality control, i found a large proportion...
Rcently,I have read many papers about small rna of plant,smRNA-seq is a mainly method ,but no...
I ran fastqc on an RNA seq experiment, and the control's k-mer content is weird:
around the m...
Use of this site constitutes acceptance of our .
Powered by
version 2.3.0
Traffic: 1712 users visited in the last hour当前位置:
微信扫一扫分享到朋友圈
&&&&近年来RNA-Seq被的广泛应用,衡量基因表达水平的方法也比较多,例如RPK, RPKM, FPKM, TPM等。&他们之间的关系是什么?如何定义?小编找到一个网络搜集的视频和资料供大家学习。It used to be when you did RNA-seq, youreported your results in RPKM (Reads Per Kilobase Million) or FPKM(FragmentsPer Kilobase Million). However, TPM (Transcripts Per Kilobase Million) is now becoming quite popular. Since there seems to be a lot of confusion about theseterms, I thought I’d use a StatQuest to clear everything up.These three metrics attempt to normalizefor sequencing depth and gene length. Here’s how you do it for RPKM:Count up the total reads in a sample anddivide that number by 1,000,000 – this is our “per million” scaling factor.Divide the read counts by the “per million”scaling factor. This normalizes for sequencing depth, giving you reads permillion (RPM)Divide the RPM values by the length of thegene, in kilobases. This gives you RPKM.FPKM is very similar to RPKM. RPKM was madefor single-end RNA-seq, where every read corresponded to a single fragment thatwas sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq,two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only differencebetween RPKM and FPKM is that FPKM takes into account that two reads can map toone fragment (and so it doesn’t count this fragment twice).TPM is very similar to RPKM and FPKM. Theonly difference is the order of operations. Here’s how you calculate TPM:Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).Count up all the RPK values in a sample anddivide this number by 1,000,000. This is your “per million” scaling factor.Divide the RPK values by the “per million”scaling factor. This gives you TPM.So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize forsequencing depth second. However, the effects of this difference are quite profound.When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion ofreads that mapped to a gene in each sample. In contrast, with RPKM and FPKM,the sum of the normalized reads in each sample may be different, and this makesit harder to compare samples directly.Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exactsame proportion of total reads mapped to gene A in both samples. This isbecause the sum of the TPMs in both samples always add up to the same number(so the denominator required to calculate the proportions is the same, regardlessof what sample you are looking at.)With RPKM or FPKM, the sum of normalizedreads in each sample can be different. Thus, if the RPKM for gene A in Sample 1is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the sameproportion of reads in Sample 1 mapped to gene A as in Sample 2. This isbecause the denominator required to calculate the proportion could be differentfor the two samples.英文出处地址:http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/What the FPKM? A review of RNA-Seq expression&unitsThe first thing one should remember is that without between sample normalization (a topic for a later post),&NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.PreliminariesThroughout this post “read” refers to both single-end or paired-end reads. The concept of counting is the same with either type of read, as each read represents a fragment that was sequenced.When saying “feature”, I’m referring to an expression feature, by which I mean a genomic region containing a sequence that can normally appear in an RNA-Seq experiment (e.g. gene, isoform, exon).Finally, I use the random variable&Xi&to denote the counts you observe from a feature of interest&i. Unfortunately, with alternative splicing you do not directly observe&Xi, so often&E[Xi]&is used, which is estimated using the EM algorithm by a method likeeXpress,&RSEM,&Sailfish,&Cufflinks, or one of many other tools.Counts“Counts” usually refers to the number of reads that align to a particular feature. I’ll refer to counts by the random variable&Xi. These numbers are heavily dependent on two things: (1) the amount of fragments you sequenced (this is related to relative abundances) and (2) the length of the feature, or more appropriately, the effective length. Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length. In practice, the effective length is usually computed as:Li=li-uFLD+1where&uFLDis the mean of the fragment length distribution which was learned from the aligned read. If the abundance estimation method you’re using incorporates sequence bias modeling (such as eXpress or Cufflinks), the bias is often incorporated into the effective length by making the feature shorter or longer depending on the effect of the bias.Since counts are NOT scaled by the length of the feature, all units in this category are not comparable within a sample without adjusting for the feature length.&This means you can’t sum the counts over a set of features to get the expression of that set (e.g. you can’t sum isoform counts to get gene counts).Counts are often used by differential expression methods since they are naturally represented by a counting model, such as a negative binomial (NB2).Effective countsWhen eXpress came out, they began reporting “effective counts.” This is basically the same thing as standard counts, with the difference being that they are adjusted for the amount of bias in the experiment. To compute effective counts:.effCountsi=Xi ·li&/&LiThe intuition here is that if the effective length is much shorter than the actual length, then in an experiment with no bias you would expect to see more counts. Thus, the effective counts are scaling the observed counts up.Counts per millionCounts per million (CPM) mapped reads are counts scaled by the number of fragments you sequenced (N) times one million. This unit is related to the FPKM without length normalization and a factor of&:CPMi=Xi / N /1000000=Xi / N ·1000000&I’m not sure where this unit first appeared, but I’ve seen it used with&edgeR&and talked about briefly in the&limma voom paper.Within sample normalizationAs noted in the counts section, the number of fragments you see from a feature depends on its length. Therefore, in order to compare features of different length you should normalize counts by the length of the feature. Doing so allows the summation of expression across features to get the expression of a group of features (think a set of transcripts which make up a gene).Again, the methods in this section allow for comparison of features with different length WITHIN a sample but not BETWEEN samples.TPMTranscripts per million (TPM) is a measurement of the proportion of transcripts in your pool of RNA.Since we are interested in taking the length into consideration, a natural measurement is the rate, counts per base (Xi / Li). As you might immediately notice, this number is also dependent on the total number of fragments sequenced. To adjust for this, simply divide by the sum of all rates and this gives the proportion of transcripts&&in your sample. After you compute that, you simply scale by one million because the proportion is often very small and a pain to deal with. In math:.TPM has a very nice interpretation when you’re looking at transcript abundances. As the name suggests, the interpretation is that if you were to sequence one million full length transcripts, TPM is the number of transcripts you would have seen of type&, given the abundances of the other transcripts in your sample. The last “given” part is important. The denominator is going to be different between experiments, and thus is also sample dependent which is why you cannot directly compare TPM between samples. While this is true, TPM is probably the most stable unit across experiments, though you still shouldn’t compare it across experiments.I’m fairly certain TPM is attributed to Bo Li&et. al.&in the&original RSEM paper.RPKM/FPKMReads per kilobase of exon per million reads mapped (RPKM), or the more generic FPKM (substitute reads with fragments) are essentially the same thing. Contrary to some misconceptions, FPKM is not 2 * RPKM if you have paired-end reads. FPKM == RPKM if you have single-end reads, and saying RPKM when you have paired-end reads is just weird, so don’t do it.A few years ago when the&Mortazavi&et. al.&paper&came out and introduced RPKM, I remember many people referring to the method which they used to compute expression (termed the “rescue method”) as RPKM. This also happened with the Cufflinks method. People would say things like, “We used the RPKM method to compute expression” when they meant to say they used the rescue method or Cufflinks method. I’m happy to report that I haven’t heard this as much recently, but I still hear it every now and then. Therefore, let’s clear one thing up: FPKM is NOT a method, it is simply a unit of expression.FPKM takes the same rate we discussed in the TPM section and instead of dividing it by the sum of rates, divides it by the total number of reads sequenced () and multiplies by a big number (). In math:.The interpretation of FPKM is as follows: if you were to sequence this pool of RNA again, you expect to see&&fragments for each thousand bases in the feature for every&&fragments you’ve sequenced. It’s basically just the rate of fragments per base multiplied by a big number (proportional to the number of fragments you sequenced) to make it more convenient.Relationship between TPM and FPKMThe relationship between TPM and FPKM is derived by Lior Pachter in a&review of transcript quantification methods&() in equations 10 – 13. I’ll recite it here:If you have FPKM, you can easily compute TPM:Wagner et. al. discuss some of the benefits of TPM over FPKM&here&and advocate the use of TPM.I hope this clears up some confusion or helps you see the relationship between these units. In the near future I plan to write about how to use sequencing depth normalization with these different units so you can compare several samples to each other.R codeI’ve included some R code below for computing effective counts, TPM, and FPKM. I’m sure a few of those logs aren’t necessary, but I don’t think they’ll hurt.1
分享给好友
分享到微信朋友圈: 第一步 打开微信底部扫一扫 第二步 扫下面的文章二维码 第三步 右上角点击转发
相关文章Relevant
“别拉我!赵东来局长会来接我的!”追剧《人民的名义》已中毒!
有关人员全部一脸懵逼... ...
妻子梅拉尼亚:发什么呆呢!老唐!真是为这个家操碎了心、
养孩子有多难,不用羡慕别人,因为全世界都一样。
《择天记》开播了,毫无意外,不少芦苇都守在电视机前等着鹿哥~就连小鹿也在不是周五的昨天发了一条微博,呼吁大家
日前,我国首颗高通量通信卫星实践十三号在西昌卫星发射中心成功发射。实践十三号卫星采用频率更高的Ka频段通信,具有容量大、终端小、易装备等特点。标志着我国卫星通信进入高通量时代。
你可以来这儿看最好的电影……
穿越会有很多的逻辑问题和时间线错乱,但《寻秦记》用了最保守严谨的一种:不能弄乱历史。
一汽这又是要带一波节奏啊!

我要回帖

更多关于 dependent subquery 的文章

 

随机推荐