Infinite finitely generated groups fp and free groups Fr
TATA box
We start with a simple example of an infinitely finitely generated group taken from the context of introns. The DNA sequence located in the core promoter region of many eukaryotic genes is the Goldberg–Hogness sequence, also known as the TATA box. This sequence contains a noncoding segment with repeated T and A base pairs. The TATA box serves as the binding site for the TATA-binding protein and other transcription factors in some eukaryotic genes. Its consensus sequence takes the form rel = TATAAAA. Variations in this consensus sequence, resulting from genetic polymorphism, can lead to diseases like Gilbert’s syndrome and immune suppression (https://en.wikipedia.org/wiki/TATA_box ).
In our methodology, we defined the group fp = 〈A,T|rel〉, which contains an infinite number of elements. There are numerous ways to investigate this group, but we opted for a specific one. This method involves calculating the number of conjugacy classes of subgroups of index d of fp (a sequence we refer to as the card seq of fp). The card seq of fp for the selected TATA sequence is [1,1,2,3,2,8,7,10,18,28···]. Interestingly, the group H3 = 〈A, T|A2 = T3〉 has a similar card seq (at least up to the highest index we can reach with the calculations). The group H3, as defined, is isomorphic to the so-called modular group PSL(2,Z) – the projective special linear group of (2 × 2) matrices of determinant 1 with integer entries. This group has an intriguing topological interpretation as the fundamental group of the trefoil knot manifold. Thus, we find that the group fp is close to H3 as the card seq of both groups is the same, but we can easily verify that fp and H3 are not isomorphic. According to Planat et al.,23 the Hecke groups Hq = 〈A, T|A2 = Tq〉, with q = 3 or 4, have a card seq corresponding to healthy TATA box sequences. The fp group for a TATA box with a card seq resembling that of Hecke groups, with q ≠ 3 or q ≠ 4, or even that of groups slightly different from H3 and H4, signifies Gilbert’s syndrome.
Polyadenylation signals
For our second example, we select a sequence from the context of eukaryotic polyadenylation (https://en.wikipedia.org/wiki/Polyadenylation ). Polyadenylation involves the addition of a poly(A) tail to an RNA transcript, usually a mRNA. A consensus poly(A) sequence takes the form rel1 = AAUAAA, which corresponds to a two-generator group of the form fp = 〈AU|rel1〉. The card seq of such a group is found to be [1,1,1,1,1,1,1,1,1,1,···], implying a single conjugacy class for each index. It appears that the free group F1 = 〈A, U|AU〉, of rank 1, has the same card seq as the fp group with relation rel1, even though neither group is isomorphic. Another consensus poly(A) sequence takes the form rel2 = UGUAA, which corresponds to a three-generator group of the form fp 〈A, U, G|rel2〉. The card seq of such a group is found to be [1,3,7,26,97,624,4,163,···]. Intriguingly, the free group F2 = 〈A, U, G|AUG〉, of rank 2, has the same card seq as the fp group with relation rel2, despite both groups not being isomorphic. From our perspective, DNA/RNA sequences that lead to fp groups closely resembling a free group are considered healthy sequences.19,21,23 The standard poly(A) sequences mentioned earlier play a regulatory role in producing mature mRNA during translation. Sequences that generate an fp group diverging from a free group Fr may be indicative of a disease.
Aperiodic sequences, their attached groups fp and free groups
In this subsection, we elucidate how a group fp, with a card seq identified to be close to a free group Fr, can be linked to an aperiodic sequence and the profinite completion F^r. We introduced the concept of aperiodic groups and sequences in our earlier papers.21,23 Consider the motif rel = TTTATTA, which serves as a consensus sequence for the transcription factor of the DBX gene in Drosophila melanogaster (fruit fly). This gene is involved in neuronal specification and differentiation. The group fp = 〈A, T|rel〉 has the same card seq as the free group F1 of rank 1. Furthermore, by splitting rel into two segments rel = relArelT and applying the substitution maps A → relA = TTTA, T → relT = TTA, we generate the substitution sequence SDBX = A,T,AT,TTTATTA,TTATTATTATTTATTATTATTTA,···. On inspection, it is straightforward to observe that all finitely generated groups fp(l), with their generators being AT,TTTATTA,TTATTATTATTTATTATTATTTA,···, respectively, have the card seq of F1.
As per the findings of Planat et al.,23 for a substitution rule to be considered aperiodic it must satisfy two conditions: (1) The substitution matrix M must be primitive, meaning it should be a strictly positive matrix (all entries > 0), irreducible, and Mk should be strictly positive for some k. This condition is denoted as M ≫ 0. (2) The Perron–Frobenius λPF eigenvalue must be irrational. It is worth noting that the Perron–Frobenius eigenvector of an irreducible non-negative matrix is the only one whose entries are all positive. The aforementioned sequence has a substitution matrix:
M=(1312).
One can verify that M is primitive as M2 ≫ 0 and λPF=3+13/2. Conditions (1) and (2) are satisfied, implying that the substitution SDBX is aperiodic. Of note, numerous other genes have transcription factors with a motif rel generating an aperiodic sequence.21
Aperiodic sequences and the profinite groups F^r
This section can be skipped without affecting the comprehension of the rest of the paper. It endeavors to answer the question of why the aforementioned groups fp(l) produce the same card seq as that of the free group Fr. The tentative answer is that the profinite completion of all groups fp(l) is the profinite group F^1. By making this observation, we aligned the aperiodicity of sequences with profinite groups. Profinite groups were introduced by Grothendieck in the context of algebraic geometry.22 Here, we describe the necessary ingredients for the layperson, focusing first on F^1 and then on F^2, and their relevance to our present work.
A group G can be considered a topological group by applying discrete topology, in which the elements of G are points of a discrete space, form a discontinuous sequence, and are isolated from each other. Every subset is open in the discrete topology. A profinite group is a topological group that, in a certain sense, is assembled from a system of finite groups. A profinite group requires a system of finite groups and group homomorphisms between them. Given a group G, there is a related profinite group G defined as the inverse limit Ĝ = lim←G/N, of the groups G/N, where N runs through the normal subgroups of G of finite index. A normal subgroup is a subgroup that remains invariant under conjugation by members of the group. Each finite quotient group corresponds to a normal subgroup N of G and the profinite completion Ĝ can be perceived as containing an analog of each of these normal subgroups. The profinite group Ĝ exhibits several properties: it is nonabelian, residually finite, (meaning that for any nonidentity element g in Ĝ, there exists a finite quotient of Ĝ in which g is not the identity), and totally disconnected (meaning that the only connected subsets of Ĝ are singletons, sets containing only one element). In general, an explicit construction of profinite groups Ĝ cannot be obtained. However, F^1 and F^2 are not too complex to handle.
Considering the profinite group F^1, we begin with F^1. The free group F1 on a single generator can be described as a group with one generator, say a, and no relations. It consists of all possible finite strings that can be formed by combining the generator and its inverse. It is the infinite cyclic group Z = {1,a,a−1,a2,a−2,a3,a−3,···}. Now, we discuss the profinite completion of F1. The profinite group F^1 is isomorphic to the group of all units of the commutative ring of p-adic integers Zp, across all primes p. It is often denoted as Zp*, as it corresponds to the elements of Zp with a valuation of zero. The p-adic integers are a special class of numbers used in number theory and algebraic geometry.
Considering the profinite group F^2, we briefly discuss F^2. This topic was first described by Grothendieck.22 The subject is complex and connected to the so-called Belyi theorem, a fundamental result that establishes a connection between algebraic curves defined over the algebraic closure of the rationals, Q, and certain rational functions called Belyi functions. An algebraic curve defined over Q can be represented as a branched covering of the Riemann sphere (the complex projective line P1(C)) branched only over three points (usually taken as 0, 1, and ∞) if and only if the curve itself is defined over a number field, which is a finite extension of the field of rational numbers Q.
In other words, the Belyi theorem implies that an algebraic curve defined over a number field can be mapped to the Riemann sphere in such a way that the ramification (branching) is restricted to just three points. The rational functions that provide these branched coverings are known as Belyi functions. The significance of the Belyi theorem lies in the fact that it provides a method to study algebraic curves defined over number fields by analyzing their ramified coverings and the associated ‘dessins d’enfants’, which are combinatorial objects encoding the ramification data. Specifically, we have the crucial result that:
π^1(P1(C)\{0,1,∞})≅F^2
i.e. the so-called étale fundamental group for the triply branched projective line is the profinite group F^2.SL2(C) representations of groups fp and a Groebner basis G
While the previous section describing profinite groups showcases the importance of algebraic geometry in the context of DNA/RNA sequences, it remains somewhat abstract. To address this, we can consider the representations of an fp group over the space-time-spin group SL2(C), as we did in previous studies.18,19,21 Representations of fp in SL2(C) are homomorphisms ρ: fp → SL2(C) with character κρ(g) = tr(ρ(g)), g ∊ fp.The notation tr(ρ(g)) signifies the trace of the matrix ρ(g). The set of characters is used to determine an algebraic set by taking the quotient of the set of representations ρ by the group SL2(C), which acts by conjugation on representations.24,25 In such papers, we showed that the character variety of fp is a set comprised of a sequence X of multivariate polynomials. A particular basis related to X is the Groebner basis G(X), whose factors define hypersurfaces.
Our previous paper focused on a possible algebraic approach of topological quantum computing.18 In two subsequent papers,19,21 we investigated SL2(C) representations of short DNA/RNA sequences (e.g., the consensus sequence of a transcription factor or the seed of a miRNA) and related them to a potential disease. For a two-generator group fp, the factors are three-dimensional surfaces. In general, these surfaces can be classified by mapping them to a rational surface across five categories.19 Often encountered surfaces are degree p Del Pezzo surfaces where 1 ≤ p ≤ 9. A rational surface may either be nonsingular, almost nonsingular, having only isolated singularities, or singular. Almost nonsingular surfaces are key in our context. A simple singularity is referred to as an A-D-E singularity and must be of the type An, n ≥ 1, Dn, n ≥ 4, E6, E7, or E8. The A-D-E type is mirrored in the notation we employ. For instance, S(lA1,mA2,nA3,···) denotes a surface containing l type A1, m type A2, n type A3 singularities, etc. A generic surface is the Cayley cubic we encountered in our previous papers, defined as S(4A1) = xyz+x2 +y2 +z2 −4.19
For a three-generator group fp, the factors of G(X) are seven-dimensional surfaces of the form Sa,b,c,d(x,y,z). Some of them belong to the Fricke family,19 which is associated with the four-punctured sphere. But for a chosen set of parameters a,b,c,d, the hypersurface reduces to an ordinary three-dimensional surface. For a four-generator group fp, the factors of G(X) are 14-dimensional surfaces containing four copies of the form S(x,y,z), S(x,u,v), S(y,u,v), and S(z,v,w) for selected choices of eight parameters.
Groebner basis of the TATA box
The Groebner basis for the character variety associated with the fp group of generators rel = TATAAAA of the TATA box as discussed above, is found to be:
GTATA = (z4 − xy2 − xyz + x2 + y2 + yz − 3z2 + x − 2) (x2z − xy − xz + y − z) S(A2)S(A4) (x3 − z2 − 3x + 2),
where S(A2) = x2y − z3– xz – y + 3z and S(A4) = xz2–x2–yz − x + 2 are degree 3 Del Pezzo surfaces. The Groebner basis GTATA comprises a degree 2 Del Pezzo surface (Fig. 1a, and a rational scroll whole analytic expression is in the first row. Both surfaces are singular. The second row consists of two surfaces with simple singularities of type A2 and A4, respectively. The last term represents a curve (not a surface).Groebner basis for polyadenylation signals
For the first polyadenylation signal considered in the paragraph describing infinite finitely generated groups. The relation of the fp group is rel1 = AAUAAA. The corresponding Groebner basis is:
Grel1 = 3 rational scrolls × P2 × S(4A1)S(A1) × curve.
The Groebner basis Grel1 contains three rational scrolls, a surface birationally equivalent to the projective plane P2, the Cayley cubic S(4A1), the degree 3 Del Pezzo surface S(A1) = x2y − xz2 – xz + yz + x − y (Fig. 1b) and a curve.
For the second polyadenylation signal considered above in the paragraph describing groups fp and Fr, the relation of the fp group is rel2 = UGUAA. The factors of G(X) are seven-dimensional hypersurfaces Sa,b,c,d(x,y,z). However, by choosing specific parameters, such as S0,0,0,0(x,y,z) or S1,1,1,1(x,y,z), we obtained three-dimensional surfaces. These were found to be degree 3 Del Pezzo surfaces with simple singularities of the form S(lA2), with l = 1, 2, or 3, quadrics, or curves.
Groebner basis of the transcription factor of DBX gene
For the DBX gene studied in the paragraph on aperiodic sequences, the Groebner basis takes the form of GDBX = scroll × P2 × S(A4) × S(A2) × S(4A1) × curve, where scroll = y2z − xy − yz + x − z and P2 = z4 − x2y + xz − 4z2 + y + 2 are singular. The other factors are DP3 surfaces with isolated singularities that are S(A4) = yz2 − y2 − xz − y2, S(A2) = z3 − xy2 + yz + − 3z, the Cayley cubic S(4A1) and curve = y3 − z2 − 3y + 2.