Welcome to the documentation!#

Introduction#

dask-ngs is a Python library for scalable bioinformatics data analysis.

It is built on top of Dask and Oxbow.

dask_ngs.read_bam(path: str | Path, chunksize: int = 10000000, index: str | Path | None = None) → DataFrame[source]#

Map an indexed BAM file to a Dask DataFrame.

Parameters:

path (str or Path) – Path to the BAM file.
chunksize (int, optional [default=10_000_000]) – Approximate partition size, in compressed bytes.
index (str or Path, optional) – Path to the index file. If not provided, the index file is assumed to be at the same location as the BAM file, with the same name but with the additional .bai or .csi extension.

Return type:

dask.dataframe.DataFrame

dask_ngs.read_bcf(path: str | Path, chunksize: int = 10000000, index: str | Path | None = None) → DataFrame[source]#

Map an indexed BCF file to a Dask DataFrame.

Parameters:

path (str or Path) – Path to the BCF file.
chunksize (int, optional [default=10_000_000]) – Approximate partition size, in compressed bytes.
index (str or Path, optional) – Path to the index file. If not provided, the index file is assumed to be at the same location as the BCF file, with the same name but with the additional .csi extension.

Return type:

dask.dataframe.DataFrame

dask_ngs.read_vcf(path: str | Path, chunksize: int = 10000000, index: str | Path | None = None) → DataFrame[source]#

Map an indexed, bgzf-compressed VCF.gz file to a Dask DataFrame.

Parameters:

path (str or Path) – Path to the VCF.gz file.
chunksize (int, optional [default=10_000_000]) – Approximate partition size, in compressed bytes.
index (str or Path, optional) – Path to the index file. If not provided, the index file is assumed to be at the same location as the VCF.gz file, with the same name but with the additional .tbi or .csi extension.

Return type:

dask.dataframe.DataFrame