Calculating A Similarity/difference Matrix From Equal Length Strings In Python
I have pairs of equal-length strings in Python, and an accepted alphabet. Not all of the letters in the strings will come from the accepted alphabet. E.g. str1 = 'ACGT-N?A' str2 =
Solution 1:
Using dot product of boolean matrices (easiest way to keep the order right):
def simMtx(a, x, y):
    a = np.array(list(a))
    x = np.array(list(x))
    y = np.array(list(y))
    ax = (x[:, None] == a[None, :]).astype(int)
    ay = (y[:, None] == a[None, :]).astype(int)
    return np.dot(ay.T, ax)
simMtx(alphabet, str1, str2)
Out[183]: 
array([[1, 1, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0]])
Solution 2:
This task can be done succinctly using set, a few comprehensions and a pandas.DataFrame as:
Code:
from collections import Counter
import pandas as pd
defdot_product(allowed, s1, s2):
    in_s1 = {c: set([y.start() for y in [
        x for x in re.finditer(c, s1)]]) for c in allowed}
    in_s2 = {c: set([y.start() for y in [
        x for x in re.finditer(c, s2)]]) for c in allowed}
    return pd.DataFrame(
        [[len(in_s1[c1] & in_s2[c2]) for c1 in allowed] for c2 in allowed],
        columns=list(allowed),
        index=list(allowed),
    )
Test Code:
str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'print(dot_product_sum(alphabet, str1, str2))
Results:
A  C  G  T
A1101
C  0000
G  0010
T  1000
Post a Comment for "Calculating A Similarity/difference Matrix From Equal Length Strings In Python"