PSEQ data output
This page describes some of the different formats that PSEQ can export to. In contrast to the various views of genetic data and meta-data that are available, these options are designed to produce machine-readable, rather than more human-readable, output. Run pseq help output for any other current options.
- VCF: writing to Variant Call Format with write-vcf
- VARDB: making a new VARDB based on filtered, consensus calls with write-vardb
- PLINK: create PLINK fileset with write-ped
- v-matrix: output genotype data as a simple rectangular matrix of alternate allele counts
- g-matrix: similar to v-matrix, but with data grouped by gene/group
- meta-matrix: write variant meta-information as a simple rectangular matrix
- v-meta-matrix: similar to meta-matrix, but for genotype meta-information
- BEAGLE: genotype likelihoods with write-lik
Output to a VCF file
To write (to stdout) project data in VCF format, use the command:
pseq /path/to/my/project write-vcf
This can be combined with a mask, for example to write a VCF that only lists singleton variants in the target region, excluding variants that fail a filter called GTAKStandard, and only showing the tags DP and AN, for example:
pseq /path/to/my/project write-vcf --mask loc=targets filter.ex=GATKStandard mac=1-1 --show DP AN > my.new.vcf
To output a file in BGZF-compressed format (such that can be indexed by the index-vcf command), add the flag:
--format BGZF --file filename
Thus, to directly convert a plain-text VCF to a BGZF-compressed one:
pseq my.vcf write-vcf --format BGZF --file my.vcf.gz
which will create a new BGZF compressed file my.vcf.gz.
Output to a new variant database (VARDB)
To create a new VARDB, use the command:
pseq proj1 write-vardb --new-project proj2 --new-vardb proj_out/vardb2
This creates a new project, with the project specification file at proj2, and the new VARDB is named vardb2 (but in the same folder as the original VARDB, in this case). This can be combined with a mask command to restrict the amount of information (individuals, variants, samples, meta-fields, etc) that is present in the new project. Typically, PLINK/Seq assumes the model of filtering on-the-fly rather than recreating new projects/files to represent different stages of filtering/QC, but there can be many instances where it is useful to create a new project. In particular, only consensus genotype and variant information is stored in the new VARDB, and multiple files are merged into a single file in the new project. For example, considering the project created in the tutorial:
pseq proj var-summary
---Variant DB summary--- 4449 unique variants File tag : CEU (3489 variants, 90 individuals) File tag : TSI (3281 variants, 66 individuals)
If we create a new project, in this example only for chromosome 7:
pseq proj write-vardb --new-project proj2 --new-vardb vardb2 --mask reg=chr7
pseq proj2 var-summary
---Variant DB summary--- 186 unique variants File tag : 1 (186 variants, 156 individuals)
That is, the 186 variants on chromosome 7 from CEU and TSI samples are now combined in a single sample in a new VARDB, that can be accessed with commands such as:
pseq proj2 v-stats
Note: sample-specific variant meta-information is not imported into the new VARDB when more than a single sample is selected from the original project.
Output to a PED-formatted file
To create a file that can be read by PLINK, use the command:
pseq /path/to/my/project write-ped --name mydata
will create two files
that can be read by PLINK with the command
plink --tfile mydata
In order to generate a valid PLINK fileset, multi-allelic markers are set to a missing genotype value; phase and ploidy information is lost also. This command can be combined with a mask via the --mask option. To use the family and individual IDs in the TFAM file (i.e. FID and IID instead of the standard PLINK/SEQ ID and a missing value), add the following:
Output genotype data as a rectangular matrix
To write a file that contains the number of non-reference alleles for each genotype in a rectangular format (that can be easily loaded into R, for example), use the command:
pseq /path/to/my/project v-matrix
The format is a header row:
VAR REF ALT ID1 ID2...
followed by one line per variant. Null genotypes are represented with the NA code. This command can be combined with a mask via the --mask option.
Output genotype data aggregated by gene/group as a rectangular matrix
To create a rectangular text file that scores each individual for each gene, or set of variants:
pseq /path/to/my/project g-matrix --mask loc.group=CCDS
Some grouping option must be specified in the mask (i.e. loc.group). The format is a header row is:
GENE NV ID1 ID2...
followed by one line per gene. Null genes are represented with the NA code. The GENE field is the gene-name; the NV field is the number of variants (post any filters) in this gene. Each individual is scored for the number of minor alleles that individual carries for that gene/group. If --collapse is added, then scores are reduced to 0s and 1s, indicating the presence of at least one minor allele. If --hide-invariant is included, then genes that are invariant (e.g. either nobody carries a one minor allele) are not reported. As for most commands, any mask options can be included too.
Output variant meta-information as a rectangular matrix
To create a matrix of per-variant meta-information, use the command:
pseq /path/to/my/project meta-matrix
This will generate a variant-by-meta-field matrix, e.g.
CHR POS ID AB DP CODING 1 10001 rs0001 NA 6652 Missense 1 20001 . 0.56 1124 Silent ...
The behavior of this command can be modified with the options
--show AB DP only show fields AB and DP --hide AB DP show all but fields AB and DP
If the project contains more than one file, then the variant meta-information specific to each file is shown with the prefix F1_, F2_, etc. Any mask can be applied, as for most commands.
Output genotype meta-information as a rectangular matrix
For a single, genotypic meta-field (e.g. per-individual read-depth, encoded DP), the command
pseq /path/to/my/project v-meta-matrix --name DP
This will dump a variant-by-individual matrix of read-depth information. Missing values are encoded as NA; the file is tab-delimited e.g. for individuals with IDs P1 to P4:
MARKER P1 P2 P3 P4 chr1:1000 22 65 10 NA chr1:1010 8 45 15 2 chr2:999 16 30 8 NA
Any mask can be applied, as for most commands.
Output genotype likelihoods in BEAGLE format
To write a file of genotype likelihoods (that is correctly formatted for BEAGLE), use the command:
pseq /path/to/my/project write-lik
which writes to standard output. This command assumes that each genotype has a meta-information field named GL (genotype likelihood), which is a vector of 3 floating-point values, or PL tag, which is the same but Phred-scaled integers. The genotype likelihoods are scaled to sum to 1.0. The format of the header row is:
VAR REF ALT ID1 ID1 ID1 ID2 ID2 ID2...
where ID1 and ID2 are the IDs for the first and second persons in the dataset, followed by one line per variant in the above format (i.e. each genotype is represented by 3 values). This command can be combined with a mask via the --mask option.