Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create listgenes_core.txt file? #8

Open
KTbiotech opened this issue Apr 13, 2021 · 3 comments
Open

How to create listgenes_core.txt file? #8

KTbiotech opened this issue Apr 13, 2021 · 3 comments

Comments

@KTbiotech
Copy link

Is there a way to create a file analysis/listgenes_core.txt? I don't see any instructions mentioning it.

@pedrorvc
Copy link
Contributor

pedrorvc commented Apr 13, 2021

Hello @KTbiotech,

The analysis/listgenes_core.txt is created by copying the loci list of a threshold from the analysis/Genes_95%.txt file that is created by TestGenomeQuality.
This file contains the list of loci that are present in 95% of the strains per threshold.
In the tutorial, we chose the loci list of any threshold from 60 to 195 since the number of loci is stable, as verified by the plot created with TestGenomeQuality.

Let us know if this answers your question.

Cheers,
Pedro

@bgka2009
Copy link

Dear Pedro,

same like KTbiotech.
how can I copy this Genes_95%.txt file to listgenes_core.txt file based on threshold?
for this analysis, do I need further script or just copy and change the file name??

if possible, could you give some detail for this procedure?
thank you in advance

@pedrorvc
Copy link
Contributor

pedrorvc commented Apr 28, 2021

Dear @KTbiotech and @bgka2009,

The Genes_95%.txt file contains two columns: Threshold and Present_genes:

Threshold	Present_genes
0	GCA-000007265-protein1.fasta GCA-000007265-protein10.fasta GCA-000007265-protein101.fasta
5	GCA-000012705-protein4139.fasta GCA-000012705-protein4161.fasta GCA-000196055-protein1567.fasta
...
195	 GCA-001592385-protein969.fasta GCA-001592385-protein970.fasta GCA-001592385-protein971.fasta

You need to copy the list of files (Present_genes column) from the threshold chosen and paste it on a new file, one file per line, and name it listgenes_core.txt.

If you simply copy and change the filename it will not work.

Below is a small Python3 snippet to create a file with the genes list from the desired threshold.

import csv

genes_95 = "path/to/Genes_95%.txt"

with open(genes_95, "r") as f:
    genes_95_data = csv.DictReader(f, delimiter="\t")
    for row in genes_95_data:
        if row["Threshold"] == "[CHOSEN THRESHOLD]":
            list_genes = row["Present_genes"].replace(" ", "\n")

output_dir = "path/to/output_dir/listgenes_core.txt"

with open(output_dir, "w") as out:
    out.write(list_genes)

This snippet is just a suggestion, you may use whatever procedure you are most confortable with.

We will work on revising the tutorial instructions to make this and other issues more clear.

Please let us know if you were able to solve the issue.
Pedro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants