|
mutation t@sting
QueryEngine Documentation |
|
Input
Modifyable HTML elements are highlighted in blue
VCF file
Input files have to be in VCF format,
coordinates must refer to GRCh37 (also called hg19). Up to now, we do not offer processing of (merged) VCF files containing variants obtained
from sequencing of two or more samples. Thus, the uplpoaded VCF file may only contain data from one sample.
E-mail address
Has to be provided in order to get notified when your MutationTaster results are ready. These can be browsed
for three weeks on our server and will afterwards be deleted.
HTML files
If you want to download all the single MutationTaster results in HTML format when the QueryEngine has finished your analysis,
you have to choose create HTML files. The whole QueryEngine run takes longer then because
the writing, storing and zipping of the HTML files takes some time.
If you are only interested in downloading the summarized results in TSV format, choose
don't create HTML files.The
QueryEngine will run much faster then. You can nevertheless watch the detailed HTML results online on our server because
we provide direct links to query MutationTaster again on demand for every variant. We assume that you probably will not watch
every single variant from your Exome Sequencing Project (or similar), that's why we think it might be better to re-query the
most interesting variants afterwards instead of storing thousands of HTML files in advance.
Analysis settings
search for homozygous variants
Check yes if you are interested in MutationTaster results for homozygous variants - heterozygous variants
will be neglected. If unchecked, all variants in your VCF will be processed (unless other options checked).
search for compound heterozygous variants
Check yes if you are interested in MutationTaster results for compound heterozygous variants.
If unchecked, all variants in your VCF will be processed (unless other options checked).
This option is not yet implemented, but will soon be.
combine neighbouring variants
Sometimes single base exchanges are located very close to each other. If considered separately as single alterations, they might seem harmless,
but if they act together, they might be deleterious. For this reason we offer to combine neighbouring variants (only single base exchanges) and treat them as if they were
one, but more complex, alteration. Check yes if you are interested in this. The analysis of the combined
alterations is conducted in addition to the analysis of the single alterations.
This option is in beta status. Please let us know if you encounter any problems or inconsistencies. Thank you!
analyse complete VCF / variants on chr / analyse custom regions / exclude custom regions
If you don't need your complete VCF file to be analysed, you can save time by constricting analysis to certain regions
(for example linkage- or homozygous regions). Choose analyse custom regions (a text field
will open) and enter your regions of interest in bed-format. Some people are interested in variants all over the genome,
but mostly in exonic ones. They can leave the option analyse complete VCF selected, but
use a ready-made set of all suitable Ensembl69 exons for analysis by additionally ticking the
...but only exons option. Since many people are also interested in intronic variants
which are however close to exons, you can enter your favorite value between 0 and 99 - this is the number of "flanking" bases
adjacent to intron/exon borders which are additionally analysed.
The ...but only exons option is also available if you are interested in all variants on a
certain chromosome (choose analyse variants on chr and enter your favorite chromosome) but
again want to exclude intronic ones. You can also exclude certain regions with the
exclude custom regions option.
This option is in beta status. Please let us know if you encounter any problems or inconsistencies. Thank you!
filter against 1000G
Here, you may specify filter options to skip analysis of your variants that were also found in the
1000Genomes Project (1000G).
If you wish to exclude variants found in TGP 4 or more times in homozygous state but include all heterozygous variants,
you can leave everything as it is (default setting). But you are free to change the number of cases that have to be present in 1000G in order to exclude variants from
analysis, or you may additionally filter out variants in heterozygous state found 1000G. For this purpose, check the checkbox and adjust the number in
the corresponding text field (heterozygous in ... or more 1000G samples). Filtering of heterozygous variants is not turned-on by default. If you do not want to filter
against 1000G at all, uncheck both boxes. Once a box is checked, there must be a numerical value entered, This can also be zero (0), which virtually is the same as if you
would have unchecked the box. Checking the checkbox and leaving the textbox empty will result in an error message.
minimum coverage
Very low covered positions don't offer reliable data. Therefore, it is useful to exclude such variants from analysis (if not already done during
variant calling / pileup). We offer the possibility to skip variants that are covered below a user-defined threshold. To this end, adjust the number
in the corresponding text field. If you don't want to exclude poorly covered variants, fill in 0.
Default # is a minimum coverage of 4.
queue status
We display the current load of the query engine. Jobs are automatically sorted into in different queues depending on their nature and size.
Submitted jobs are generally sorted into either the small (VCF file containing 1-500 variants / lines), medium (VCF files containing
501-10.000 variants / lines) or large (VCF files containing more than 10.000 variants / lines) queue. The different queues are executed
with different priorities and independent from each other. Even when the queue for large jobs is full, a small job will be processed immediately,
if there are free slots in the small queue.
DB queries are executed several times during every query engine run, independent of the size of the submitted VCF file. Database (DB)
jobs are automatically generated during a query engine run and filled into a separate queue, since they may put a heavy load on our server.
Output
Statistics
Most often, MutationTaster will not analyse each and every line of your VCF file, either because you have set certain filters,
or because certain variants were not suitable for analysis with MutationTaster.
submitted variants - Number of alterations (lines) in VCF file.
pre-discarded variants - Number of variants which were filtered out according to user input (below coverage,
not homozygous, out of specified region / chromosome) or due to input / format errors (e.g. variant equals refseq,
reference allele equals alternative allele, Indel is too long or neither genotype nor frequency is supplied). All
pre-discarded variants are written to a file (skipped.txt) which can be downloaded on the results page as soon as your job
has been finished.
analysable variants - Number of variants which were suitable for analysis. These can be significantly more than
the lines in the VCF, because sometimes one line in the VCF contains more than one alternative allele. Additionally, if you
choose to combine neighbouring variants, the number will even rise.
discarded (TGP) - Number of variants ignored for analysis due to presence in 1000 Genomes Project (applies only
if one or both of the two filter against TGP options are set).
discarded (out of gene/exon/region) - Number of variants which were excluded from analysis because they are
a) extragenic and/or b) out of/distant from exon (applies only if option for only exons is set) or c) out of
chromosome (applies only if option for only chromosome CHR is set) or c) out of region (applies only if option for
analyse custom region is set) or d) inside region (applies only if option for exclude custom region is set)
analysed variants - Number of variants which were analysed with MutationTaster. These will normally be
significantly more than the analysable variants, because for most variants, more than one (suitable) transcript will be found.
Storage and download of results
MutationTaster results are stored in our database and can be accessed online on our server. Up to now,
results are not deleted, but as soon as the QueryEngine is made public, we will store your results only
for three weeks. Afterwards, they will automatically be deleted. You can download your results as zip-archive.
We offer two download possibilities:
a) download results as archive of single HTML files (only recommended for input VCFs with few variants) - the resulting archive contains all the MutationTaster
results files as single HTML files. Since they are (up to now) neither divided into sub-folders, nor summarized in one overview
HTML-file, this zip-archive gets bulky when many variants were processed. That's why we don't recommend it for large input VCFs.
Moreover, please be sure to activate the 'create HTML files' option before submitting your VCF (otherwise we will not store the
HTML results files and you cannot download them).
b) download results summarized with main features as TSV file(s) (generally recommended, especially for large input VCFs) -
the resulting archive contains one TSV file with one variant per line and the following columns per variant:
chromosome | position | genesymbol | pred_index | model | probability | alt_type | AAE | snp_id | allele_ref | allele_alt | f_ClinVar
We generally recommend to download your results as TSV for two main reasons: 1) The QueryEngine will run much faster if no HTML
files have to be created and saved and 2) the resulting TSV file can be filtered and/or sorted both before and after
downloading. We offer to filter out certain variants (e.g. those that were excluded due to presence in TGP) and to sort the
remaining variants according to user-specified criteria (see Display / filter / export results). Once
downloaded and stored on your own machine, you can still re-sort the TSV file with Microsoft Excel or similar
spreadsheet programs.
Especially when large VCF files have to be analyzed (e.g. from Exome Sequencing) it is very likely that you won't look at each
and every single HTML result file, but only at some HTML files fulfilling certain criteria (e.g. prediction disease causing or
variants in certain genes). You have all the results in the TSV file and can then query the interesting variants manually.
The option to delete your data as soon as your download is completed will soon be added.
Display / filter / export results
The results stored in our database can be sorted and filtered by different criteria for either displaying and browsing them
directly on our server or for exporting them.
sort & group
1) sort & group by prediction | model | gene symbol; choose this option for sorting from prediction disease causing to prediction polymorphism,
from complex_aae via simple_aae to without_aae model, from gene symbols starting with A to gene symbols starting with Z
2) sort & group by prediction | model | gene symbol | variation; similar to 1) but additional level of grouping according
to the variation.
3) sort by these attributes; choose this option if you want to sort & group by customized criteria in one, two or three levels. The
different criteria are:
genesymbol ASC (genesymbol from A to Z)
genesymbol DESC (genesymbol from Z to A)
chromosome ASC (chromosome from 1 to Y)
chromosome DESC (chromosome from Y to 1)
position ASC (ascending)
position DESC (descending)
pred_index ASC (prediction from disease causing to polymorphism)
pred_index DESC (prediction from polymorphism to disease causing)
pred_problem ASC (reason for prediction problem, from A to Z)
pred_problem DESC (reason for prediction problem, from Z to A)
model ASC (model used by the classifier, from without_aae via simple_aae to complex_aae)
model DESC (model used by the classifier, from complex_aae via simple_aae to without_aae)
probability ASC (probability of the prediction ascending)
probability DESC (probability of the prediction descending)
alt_type ASC (alteration type in order single base exchange, insertion and deletion, insertion, deletion)
alt_type DESC (alteration type in order deletion, insertion, insertion and deletion, single base exchange)
snp_id ASC (rs-number ascending)
snp_id DESC (rs-number descending)
hide
There are the following options to hide certain alterations: Silent alterations (i.e. without amino acid exchange), all
predicted polymorphisms, known polymorphisms (i.e. homozygous > 4 times in 1000Genomes Project) and prediction problems.
Selection of options is valid for both displayin results in the browser as well as downloading them as TSV.
get the data
The results can either be displayed online in your browser (choose display) or be downloaded
as TSV (choose export as TSV). Filtering and sorting options are applied to both methods.
General comments on MutationTaster output
Please note: The option to show nucleotide alignment (multi-species alignment of nucleotide sequence around the submitted
alteration) in the MutationTaster results is turned-off by default in the QueryEngine. This is mainly due to speed issues,
since the BLAST call slows down MutationTaster and the results are not used by the Bayes Classifier anyway.
show nucleotide alignmentis turned-on by default if you use the link to
re-query single variants in MutationTaster which is provided in the results table on our server.
The QueryEngine will process the variants from the submitted VCF-file in all suitable Ensembl69 transcripts.
Some transcripts will not be included in the analysis, e.g. transcripts which a) have no or too many corresponding NCBI gene
ID(s), b) are protein-coding but have no correct start codon (ATG) or stop codon (TGA, TAA, TAG). MutationTaster first tries
to use protein-coding transcripts and if there is at least one, it won't search for transcripts of other biotypes. Only if
there are no protein-coding transcripts available, it will try to use transcripts of other biotypes (although certain biotypes
are straightaway and principally excluded from analysis, e.g. nonsense_mediated_decay,ambiguous_orf,TR_pseudogene etc.).
Please see the MutationTaster documentation for details of the MutationTaster analysis results
and MutationTaster error messages.
Known bugs and limitations
- only VCF files containing data from one sample can be processed
Future plans
- analysis of merged VCF files with data from multiple samples or several single-sample VCF files at once
- filter options to exclude / include variants present / absent in defined samples
Contact
In case you discover bugs, have suggestions or questions, please write an e-mail to
Jana Marie Schwarz (jana-marie.schwarz AT charite.de) or to
Dominik Seelow (dominik.seelow AT charite.de).
We also appreciate hearing about your general experiences using this QueryEngine and MutationTaster.