I'm having a hard time implementing a calculation of the edit distance from a sam record. For a bwa mem alignment, I would resort to taking the NM tag (explicitly defined as the edit distance), but this tag is not set by all aligners, so that's where the trouble starts... Alternatively, I would parse the MD tag to get the number of matches and subtract that from the aligned length. But conceptually that doesn't seem equivalent to the MD tag. I scale both to the **aligned** read length. I should clarify this is for analysis of Oxford Nanopore data, which is dominated by indel errors. I'll add my python code for calculating both "definitions" of the edit distances. I use pysam, but it's more about the logic than the actual code. Using the NM tag: read.get_tag("NM")/read.query_alignment_length So I: 1. take the NM tag which should be the edit distance 2. divide it by the length of the alignment. This excludes clipped sequences from the calculation. Using the MD tag: (read.query_alignment_length - sum([int(item) for item in re.split('[ACTG^]', read.get_tag("MD")) if not item == '']))/read.query_alignment_length So I: 1. Take the MD tag and sum all sequence 'matches' by removing all non-matching `ACTG^` characters from the string (is this conceptually correct?) 2. I subtract the number of matches from the aligned read length 3. I divide this by the length of the alignment. This excludes clipped sequences from the calculation. To me, both calculations s ...
Hello members, The following statement is taken from this paper http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-160 **we should not distinguish between a k-mer and its reversed complement, and by the “canonical k-mer” we will mean the lexicographically smaller of the two.** Can anybody explain the statement in bold with an example? I know what is the meaning of k-mer and counting k-mers in reads data set. But i'm not able to understand the above statement. Thanks in advance
Most of the times, when I try to open a help page in R studio, by hitting tab when start writing the name of a function and then do F1, the page doesn't open and throws the error: "There is no package...", although that package is installed and I load it with library("package"). For example, to open the help page for "resize()", after calling library(IRanges), it throws the error: Error in find.package(if (is.null(package)) loadedNamespaces() else package, : there is no package called 'package:GenomicRanges' Then I write: library(GenomicRanges) I try again and then new error shows up: Error in find.package(if (is.null(package)) loadedNamespaces() else package, : there is no package called 'package:GenomicRanges' Is it possible to solve this error?
When we look at the filters of ensembl in biomart package in R, it shows two options for the microarray platform hg_u133a: affy_hg_u133a: Affy hg u133a probeset ID(s) [e.g. 200874_s_at] with_affy_hg_u133a: with Affymetrix Microarray hg u133a probeset ID(s) the same happens for other microarrays. The code to get filters is shown bellow: library(biomaRt) ensembl = useMart(biomart ="ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl") filters = listFilters(ensembl) Does someone know what is the difference between affy_hg_u133a and with_affy_hg_u133a?
Hey together, I have done several GenePanel Sequencing runs on the miSeq and in the analysis I get a value for the "Varaint Predicition Index". Can anyone explain me, what this index tell me? I used UMI´s in my libraries. (molecular tags) Thanks!
Hi I am mapping pair end strand-specific libraryof RNAseq to the genome. Should I use option --no-discordant in hisat2 to avoid discordant alignment? Does it affect genome guided assembly? Thanks
Helo, In VCF file, there GT/PL folumn for genotype and its likelihood values. If 2 allele are possible (reference allele and alternative allele) the column value would be like below: 0/1:56:0:80 The 56 score is correspond to reference homozygous, 0 is to heterezygous, and 80 is to alternative homozygous. My question is, if there are more than 2 allele (let's say 0 for reference and 1,2 for alternate allele), the score will consist of 6 score which is corresponds to: 1. reference homozygous (0/0) 2. alt 1 homozygous (1/1) 3. alt 2 homozygous (2/2) 4. ref and alt 1 heterzygous (0/1) 5. ref and alt 2 heterozygous (1/2) 6. alt 1 and alt 2 heterozygous (2/2) My question is what is the order in the actual VCF file? I just don't know the order of the score and its corresponding meaning. Below is the actual example of 1 line in my vcf data. 1 226548932 . ACGGCGGCGGCGGCGGCGGCGGTGGCGGCGGCGG ACGGCGGCGGCGGTGGCGGCGGCGG,ACGGCGGCGGCGGCGGCGGTGGCGGCGGCGG 39.049 . INDEL;IDV=1;IMF=1;DP=9;VDB=0.0225004;SGB=-1.15236;MQSB=0.900802;MQ0F=0;ICB=0.153846;HOB=0.0555556;AC=1,1;AN=12;DP4=4,2,1,1;MQ=60 GT:PL ./.:0,0,0,0,0,0 0/0:0,3,60,3,60,60 0/0:0,3,60,3,60,60 ./.:0,0,0,0,0,0 0/1:60,3,0,60,3,60 0/0:0,3,60,3,60,60 0/0:0,3,60,3,60,60 0/2:50,56,132,0,81,78 Look at the GT/PL list below (I have 8 samples): 1. Sample 1 : ./.:0,0,0,0,0,0 2. Sample 2 : 0/0:0,3,60,3,60,60 3. Sample 3 : 0/0:0,3,60,3,60,60 4. Sample 4 : ./.:0,0,0,0,0,0 5. Sample 5 : 0/1:60,3,0,60,3,60 6. Sample 6 : 0/0:0,3,60,3,60,60 7. S ...
I have VCFs from a number of different tumor samples sourced mostly from two different variant callers that I'm trying to merge onto one sheet: GATK (via an Illumina MiSeq machine) and LoFreq. The end goal is to upload the merged VCF to CRAVAT for easier visualization/interpretation of variant information. I'm struggling to figure out how to merge variants, though---neither vcf-merge nor CombineVariants have been able to do this successfully. Does anyone have any experience with merging these VCFs generated by these specific variant callers?
I have almost 300s Genes list (only name). I want to find tag SNPs on each gene Is there any tools to get tag SNPs easily I found below sites but it had to query one by one. https://snpinfo.niehs.nih.gov/snpinfo/snptag.html Thank you so much for your help!
Hello, I want to ask about consensus sequence generated from variant data. Let's say I have a region like below: ACATGACGATACTAACGGAACC From that region, I found 2 SNP on the 3rd and 10th nucleotide like below: POS -- REF -- ALT 3 -- A -- C 10 -- T -- A My question is, if I want to apply the consensus function, there are 2 possible sequence: 1. heterozygous sequences with 1 sequence only 1 mutation on 3rd nucleotide AND 1 sequence mutated on 10th nucleotide 2. heterozygous sequences with 1 sequence is similar to reference and other sequence consist of both mutation on 3rd and 10th nucelotides. My question is, how to decide which is the best represntative of the consensus sequence?
hi I have a peaks interval output from MACS2.1 which is attached below. How do I find the gene names using the start and end positions, whilst maintaining the fold change values in Galaxy in my file attached below? I have followed other posts on the forum such as [this one] but this does return only gene names without the expression values(fold change, p-value, etc) http://tinypic.com/view.php?pic=1zb4sw&s=9 : https://www.biostars.org/p/120382/
Let's say you're searching through a VCF with germline variants. You find a variant that's present in only one person. In addition, this person is homozygous for that variant. The chances of finding a variant like this by random chance are pretty small. Only a handful of variants should be like this. Instead, you find over a thousand. Some are close to each other on the same person. Also, the average person in a study population tends to only have a few hundred or thousand unique variants - about the same as the number of single homozygotes. How likely is it that this variant call is the result of an early somatic SDSA repair vs. something else? How can I find out if these "unique" spots are double-stranded-break/recombination hotspots? I'm just looking for opinions. Any input is welcome.
Hi. A recently published paper in [PLOS medicine] reported that they had developed a polygenic hazard score (PHS) for quantifying individual differences in age-specific genetic risk for Alzheimer disease (AD) with hazard ratios of disease associated 33 SNPs. Later in that paper, they predicted the age of AD onset using this PHS, but I couldn't find how they calculated it due to the lack of knowledge in statistics. Can someone show me how to use the hazard ratios to predict the onset age? Sorry for poor English. Thank you! : http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002258
Hello, I'm trying to run a STAR alignment on some RNA-Seq reads but my sam output file is much smaller than expected to the point where I believe it didn't work correctly. I'm not sure what I'm doing wrong. This is the command I used STAR --runThreadN 4 --genomeDir /Path/To/Index --readFilesCommand gunzip -c /Path/to/Reads/File Please let me know if there is any other information I can provide. Any help would be appreciated, Thank you in advance!
Hi. I'm trying to download TCGA DNA methylation data using 'GDCdownload' in 'TCGAbiolinks', but I'm getting this error message. > GDCdownload(query) GDCdownload will download 5201 files. A total of 110.697216048 GB Downloading as: Tue_Apr_11_09_28_55_2017.tar.gz |========================================================================================================| 100% /bin/tar: This does not look like a tar archive gzip: stdin: not in gzip format /bin/tar: Child returned status 1 /bin/tar: Error is not recoverable: exiting now Download completed Can anyone explain what went wrong? Thank you!
I'm trying to get a sub-sequence with only 4 fold, 3 fold ,etc degenerate codons. I found this part of the biopython cookbook that has some functions that could do this [link]. However the code keeps referencing an input to the functions just called "table". I haven't been able for the life of me to figure out that sort of table I should be inputting. Example: degenerated_subseq(seq, x, table): seq is the sequence, x the level of degeneracy. What is table? Does anyone have any ideas? Thanks for your help! Chris [link]: http://biopython.org/wiki/Degenerated_Codons
Is there a way that I can perform blast through NCBI from my program? Did NCBI provide such as API to do it?
It's fairly trivial to check the significance of overlap between 2 gene lists with a Fisher's exact test in R (`fisher.test()`), which is widely accepted. Is there a good alternative that would also be able to incorporate ranks or weights for each gene?