Calculating A Similarity/difference Matrix From Equal Length Strings In Python
I have pairs of equal-length strings in Python, and an accepted alphabet. Not all of the letters in the strings will come from the accepted alphabet. E.g. str1 = 'ACGT-N?A' str2 =
Solution 1:
Using dot product of boolean matrices (easiest way to keep the order right):
def simMtx(a, x, y):
a = np.array(list(a))
x = np.array(list(x))
y = np.array(list(y))
ax = (x[:, None] == a[None, :]).astype(int)
ay = (y[:, None] == a[None, :]).astype(int)
return np.dot(ay.T, ax)
simMtx(alphabet, str1, str2)
Out[183]:
array([[1, 1, 0, 1],
[0, 0, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 0]])
Solution 2:
This task can be done succinctly using set
, a few comprehensions and a pandas.DataFrame
as:
Code:
from collections import Counter
import pandas as pd
defdot_product(allowed, s1, s2):
in_s1 = {c: set([y.start() for y in [
x for x in re.finditer(c, s1)]]) for c in allowed}
in_s2 = {c: set([y.start() for y in [
x for x in re.finditer(c, s2)]]) for c in allowed}
return pd.DataFrame(
[[len(in_s1[c1] & in_s2[c2]) for c1 in allowed] for c2 in allowed],
columns=list(allowed),
index=list(allowed),
)
Test Code:
str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'print(dot_product_sum(alphabet, str1, str2))
Results:
A C G T
A1101
C 0000
G 0010
T 1000
Post a Comment for "Calculating A Similarity/difference Matrix From Equal Length Strings In Python"