Steps taken to generate a potential codon list for a given protein sequence:

First we need a codon table with the frequencies that we prefer. The corresponding data frame has to have the following columns: “AA”, “codon”, “relfreq_adj”, in which the relfreq_adj is a number from 0 to 1 that indicates the relative frequency of codons. If there is only one codon per aminoacid, this frequency would be 1.

To do this, we will recover the table from a text file (in the same folder as the script)

codons_table_selection <- read.delim(file="codon_table_cryptococcus_selection20230305.txt", stringsAsFactors = F, header = T, sep="\t")
codons_table_selection <- codons_table_selection[order(codons_table_selection$AA), ]
print(codons_table_selection)

The sequence of the protein for which we would like to obtain the corresponding coding sequence as a long character string, for example “AAAGGTLLLLLSTTVW”

myprot_seq_raw <- "MAAAGGCTLLLAAAASTTTW
              QAQQQQQTTTTWWWLLLLL"
# if the input is in a FASTA-like format, we need to remove the end of line characters and spaces
myprot_seq_clean <- gsub("[\r\n ]", "", myprot_seq_raw)

An accessory function generate_list_codons creates a list of codons for a given amino acid that has exactly the number of codons as how many times the aminoacid is found in the protein sequence. For example, for the AAAAGGG sequence, we can obtain a list of 4 codons for “Ala” and 3 codons for “Gly”. The codons are chosen randomly from the provided table of selected codons and their relative frequencies.

generate_list_codons <- function(aminoacid, codontable, aanumbers){
  #creates a list that has, for each amino acid a sublist of codons
  #that were obtained from the distribution that using the frequencies
  #present in codontable
  # - - - aminoacid is a single character
  # - - - codontable is a table with the columns "AA", "codon", "relfreq_adj"
  #"relfreq_adj" is the relative frequency for the codons chosen to be "OK"
  #- - - aanumbers is a table in which, to each aminoacid corresponds the number of
  #aminoacids present in a given protein sequence
  aarows <- which(codontable$AA==aminoacid)
  aanumberrows <- which(aanumbers$protein_aa == aminoacid)
  aacodons <- codontable[aarows, "codon"]
  aafreqs <- codontable[aarows, "relfreq_adj"]

  return(sample(aacodons, prob=aafreqs, 
                replace=T, size=aanumbers$Freq[aanumberrows]))
}

To test the function we need to transform the protein sequence in a table of frequencies:

protein_aa <- unlist(strsplit(myprot_seq_clean, split=""))
aanumbers <- as.data.frame(table(protein_aa), stringsAsFactors=F)
print(aanumbers)
# test the function:
generate_list_codons("A", codons_table_selection, aanumbers)
[1] "GCC" "GCC" "GCT" "GCT" "GCT" "GCT" "GCC" "GCC"
generate_list_codons("Q", codons_table_selection, aanumbers)
[1] "CAG" "CAG" "CAG" "CAG" "CAG" "CAG"
#for "Q" there is only one codon in the table, so all the
#codons in the list are identical: "CAG".

The second defined function takes a “clean” protein sequence and the codon frequencies table and uses the first function to create, for each amino acid, a list of codons. Then it combines these lists of codons in a DNA sequence, which is the output of the function.


protein_to_codons <- function(proteinseq, codons_table_in_shape){
  #generates a data frame that contains the initial protein sequence
  #and a proposed succession of codons
  #uses the function "generate_list_codons"
  #the input is a string of characters "proteinseq" and a table of codon frequencies
  #in the shape used by the "generate_list_codons" function.
  set.seed(550) #for reproducible generation of runs of codons for identical protein sequences
  protein_aa <- unlist(strsplit(proteinseq, split=""))
  aanumbers <- as.data.frame(table(protein_aa), stringsAsFactors=F)
  #the obtained df has two columns: "protein_aa" and "Freq"
  #it is already alphabetically sorted
  aacodons <- lapply(aanumbers$protein_aa, 
                     FUN=generate_list_codons, 
                     codons_table_in_shape, 
                     aanumbers)
  #we get the list of codons for each amino acid in a list of lists.
  aa_alphabetical <- protein_aa[order(protein_aa)]
  aa_long <- data.frame(aa=aa_alphabetical, codons=unlist(aacodons))
  #aa_long corresponds to a data frame in which the "aa" column corresponds to aminoacids
  #and the "codons" column to the corresonding codons.
  protein_aa.df <- as.data.frame(x = protein_aa, stringsAsFactors=F)
  #protein_aa.df contains the aminoacids in their order in the initial sequence
  protein_aa.df$idx <- as.integer(row.names(protein_aa.df))
  #define an index for each aminoacid to be able to put them in order at the end
  protein_aa.df <- protein_aa.df[order(protein_aa.df$protein_aa), ]
  #sort the data frame alphabetically, so that it is easy to merge it with the other table
  aatocodons.df <- as.data.frame(cbind(protein_aa.df, aa_long))
  #then reorder the final data frame according to the original order of the aminoacids
  return(aatocodons.df[order(aatocodons.df$idx), ])
}

Now, we can test the output of the function for the short sequence myprot_seq_clean and check if the results are about right. It would be much better to have a way to test this output automatically, add error correction and exception handling. Also, it would be good to modify the code in such a way as to avoid repeats and low complexity regions. For version 2…


test <- protein_to_codons(myprot_seq_clean, as.data.frame(codons_table_selection))
print(test[, c("aa", "codons")])

To manually “adjust” codons, one can save the obtained data frame in a tabular format. Otherwise, to obtain just the sequence (and check that it is OK by translating it in ApE):

print(unlist(splitted), quote=F)
[1] ATGGCCGCCGCCGGTGGCTGCACCCTCCTCCTTGCCGCCGCTGCCTCCACCACCACTTGGCAGGCCCAGCAGCAGCAGCAGACCACTACTACTTGGTGGT
[2] GGCTTCTCCTTCTTCTT                                                                                   
---
title: "R Notebook for codon optimized sequences from proteins"
output:
  html_notebook: default
  html_document: 
    keep_md: yes
  pdf_document: default
---

Steps taken to generate a potential codon list for a given protein sequence:

First we need a codon table with the frequencies that we prefer. The corresponding data frame has to have the following columns: "AA", "codon", "relfreq_adj", in which the relfreq_adj is a number from 0 to 1 that indicates the relative frequency of codons. If there is only one codon per aminoacid, this frequency would be 1.

To do this, we will recover the table from a text file (in the same folder as the script)
```{r}
codons_table_selection <- read.delim(file="codon_table_cryptococcus_selection20230305.txt", stringsAsFactors = F, header = T, sep="\t")
codons_table_selection <- codons_table_selection[order(codons_table_selection$AA), ]
print(codons_table_selection)
```


The sequence of the protein for which we would like to obtain the corresponding coding sequence as a long character string, for example "AAAGGTLLLLLSTTVW"

```{r}
myprot_seq_raw <- "MAAAGGCTLLLAAAASTTTW
              QAQQQQQTTTTWWWLLLLL"
# if the input is in a FASTA-like format, we need to remove the end of line characters and spaces
myprot_seq_clean <- gsub("[\r\n ]", "", myprot_seq_raw)
```


An accessory function `generate_list_codons` creates a list of codons for a given amino acid that has exactly the number of codons as how many times the aminoacid is found in the protein sequence. For example, for the AAAAGGG sequence, we can obtain a list of 4 codons for "Ala" and 3 codons for "Gly". The codons are chosen randomly from the provided table of selected codons and their relative frequencies.

```{r}
generate_list_codons <- function(aminoacid, codontable, aanumbers){
  #creates a list that has, for each amino acid a sublist of codons
  #that were obtained from the distribution that using the frequencies
  #present in codontable
  # - - - aminoacid is a single character
  # - - - codontable is a table with the columns "AA", "codon", "relfreq_adj"
  #"relfreq_adj" is the relative frequency for the codons chosen to be "OK"
  #- - - aanumbers is a table in which, to each aminoacid corresponds the number of
  #aminoacids present in a given protein sequence
  aarows <- which(codontable$AA==aminoacid)
  aanumberrows <- which(aanumbers$protein_aa == aminoacid)
  aacodons <- codontable[aarows, "codon"]
  aafreqs <- codontable[aarows, "relfreq_adj"]

  return(sample(aacodons, prob=aafreqs, 
                replace=T, size=aanumbers$Freq[aanumberrows]))
}

```

To test the function we need to transform the protein sequence in a table of frequencies:

```{r}
protein_aa <- unlist(strsplit(myprot_seq_clean, split=""))
aanumbers <- as.data.frame(table(protein_aa), stringsAsFactors=F)
print(aanumbers)
# test the function:
generate_list_codons("A", codons_table_selection, aanumbers)
generate_list_codons("Q", codons_table_selection, aanumbers)
#for "Q" there is only one codon in the table, so all the
#codons in the list are identical: "CAG".
```
The second defined function takes a "clean" protein sequence and the codon frequencies table and uses the first function to create, for each amino acid, a list of codons. Then it combines these lists of codons in a DNA sequence, which is the output of the function.

```{r}

protein_to_codons <- function(proteinseq, codons_table_in_shape){
  #generates a data frame that contains the initial protein sequence
  #and a proposed succession of codons
  #uses the function "generate_list_codons"
  #the input is a string of characters "proteinseq" and a table of codon frequencies
  #in the shape used by the "generate_list_codons" function.
  set.seed(550) #for reproducible generation of runs of codons for identical protein sequences
  protein_aa <- unlist(strsplit(proteinseq, split=""))
  aanumbers <- as.data.frame(table(protein_aa), stringsAsFactors=F)
  #the obtained df has two columns: "protein_aa" and "Freq"
  #it is already alphabetically sorted
  aacodons <- lapply(aanumbers$protein_aa, 
                     FUN=generate_list_codons, 
                     codons_table_in_shape, 
                     aanumbers)
  #we get the list of codons for each amino acid in a list of lists.
  aa_alphabetical <- protein_aa[order(protein_aa)]
  aa_long <- data.frame(aa=aa_alphabetical, codons=unlist(aacodons))
  #aa_long corresponds to a data frame in which the "aa" column corresponds to aminoacids
  #and the "codons" column to the corresonding codons.
  protein_aa.df <- as.data.frame(x = protein_aa, stringsAsFactors=F)
  #protein_aa.df contains the aminoacids in their order in the initial sequence
  protein_aa.df$idx <- as.integer(row.names(protein_aa.df))
  #define an index for each aminoacid to be able to put them in order at the end
  protein_aa.df <- protein_aa.df[order(protein_aa.df$protein_aa), ]
  #sort the data frame alphabetically, so that it is easy to merge it with the other table
  aatocodons.df <- as.data.frame(cbind(protein_aa.df, aa_long))
  #then reorder the final data frame according to the original order of the aminoacids
  return(aatocodons.df[order(aatocodons.df$idx), ])
}

```

Now, we can test the output of the function for the short sequence `myprot_seq_clean` and check if the results are about right. It would be much better to have a way to test this output automatically, add error correction and exception handling. Also, it would be good to modify the code in such a way as to avoid repeats and low complexity regions. For version 2...

```{r}

test <- protein_to_codons(myprot_seq_clean, as.data.frame(codons_table_selection))
print(test[, c("aa", "codons")])
```
To manually "adjust" codons, one can save the obtained data frame in a tabular format. Otherwise, to obtain just the sequence (and check that it is OK by translating it in ApE):

```{r}
#print(paste(test$codons, collapse = ""))
#this is ugly, as it writes a long string with no end of line. A better solution:
toprint <- paste(test$codons, collapse="")
splitted <- strsplit(gsub("([[:alnum:]]{100})", "\\1 ", toprint), split=" ", fixed=T)
print(unlist(splitted), quote=F)
```

