Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
78cf991
added bucherna data for screwing up
jgluck Nov 10, 2013
85d3c7f
Added breakpoint_indices.py
Nov 10, 2013
573ff09
Merge branch 'master' of http://github.com/jgluck/assembly-testing
Nov 10, 2013
3a34850
adding structure
jgluck Nov 10, 2013
8336f88
Merge branch 'master' of github.com:jgluck/assembly-testing
jgluck Nov 10, 2013
4ad51a4
added
jgluck Nov 10, 2013
54abcac
Added readAssembly function
Nov 10, 2013
fde0702
added some more lengths of sequences
jgluck Nov 10, 2013
1509f41
Merge branch 'master' of github.com:jgluck/assembly-testing
jgluck Nov 10, 2013
59fc9c6
added a little readme about the sequences
jgluck Nov 10, 2013
5f00fe7
Added readSingletons method
Nov 10, 2013
0d30a5a
Merge branch 'master' of http://github.com/jgluck/assembly-testing
Nov 10, 2013
2e8d711
Added matchSingleton function.
Nov 10, 2013
e62db99
Added naiveBreakpointDetect
Nov 10, 2013
2983906
Updated opening comments
Nov 10, 2013
f6dbfd5
moved folders all around
jgluck Nov 10, 2013
a694f0b
fixed up script
jgluck Nov 10, 2013
d793d29
runOurs now takes an alpha in $1
jgluck Nov 10, 2013
14b0865
Added convertFASTQ.py
Nov 10, 2013
77cf933
Merge branch 'master' of http://github.com/jgluck/assembly-testing
Nov 10, 2013
c94288f
error gen code for inversions
Nov 10, 2013
9db375e
Merge branch 'master' of https://github.com/jgluck/assembly-testing
Nov 10, 2013
5e2047d
455PM
Nov 10, 2013
7f48909
made changes to errorGen, uses opt parse now
jgluck Nov 11, 2013
a9c2807
Added error code for rearrangement
Nov 11, 2013
9e281cc
Alpha too long bugfix
Nov 11, 2013
368eb57
Temporary: changed header handling
Nov 11, 2013
696d322
index bugfix
Nov 11, 2013
27b2159
bugfix
Nov 11, 2013
74db182
First testcase
jgluck Nov 11, 2013
1426978
got rid of some files
jgluck Nov 11, 2013
1d307f8
added output to breakpoint-detection
Nov 11, 2013
518f023
added output to breakpoint-detection
Nov 11, 2013
10e4043
Merge branch 'master' of github.com:jgluck/assembly-testing
jgluck Nov 11, 2013
f0658a4
bugfix
Nov 11, 2013
8920e3c
added 3 testcases and cleanup files
jgluck Nov 11, 2013
40dd984
readme for test cases
Nov 11, 2013
a801177
added influenza test
jgluck Nov 12, 2013
27683e7
Preparing for fork pull request
jgluck Nov 12, 2013
3f436c8
Updated directory structure, still not finished
Dec 10, 2013
2b92151
changed errorgen
jgluck Dec 10, 2013
349f4da
added windowed detection of errors
Dec 10, 2013
706c12c
Added windowed and break-match in error detector
Dec 10, 2013
a34f47a
merge!Merge branch 'master' of https://github.com/jgluck/assembly-tes…
Dec 10, 2013
b837222
moved a bunch of stuff
jgluck Dec 10, 2013
080e6d1
type
Dec 10, 2013
0afc07e
added a bunch more test cases
jgluck Dec 10, 2013
d23a57f
Added verifier
Dec 10, 2013
ce5bb53
Merge branch 'master' of github.com:jgluck/assembly-testing
jgluck Dec 10, 2013
a1e4695
pushing for testcases
jgluck Dec 10, 2013
071d454
created test case generator
Dec 10, 2013
c11a1a2
did some things
jgluck Dec 10, 2013
96e2723
tests
Dec 10, 2013
630063a
Merge branch 'master' of https://github.com/jgluck/assembly-testing
Dec 10, 2013
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions breakpoint-detection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Breakpoint Analysis
===

In order to run our code you will require:

1. Set of contigs to run your reads against (this will be called the assembly)
2. Set of unaligned reads (this is generated by bowtie)


You run our program as follows:

python breakpoint_indices.py -a <assembly file>
-u <unaligned reads file> -o <output for detected errors>
--alpha <alpha>

The detected errors will be spilled into the output file. The alpha determines the legngth of the string at the front of the sequence we will attempt to match.

For more examples of how to run our program, look at the test scripts. They're informative and well commented.

To be continued
---
dun dun dun!
275 changes: 275 additions & 0 deletions breakpoint-detection/breakpoint_indices.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
"""
Input: an assembly fasta file and a set of unaligned reads in fasta format.
Input: alpha, window size at the beginning of the unaligned reads.
Output: a distribution over the indices of the assembly where errors are likely

Example usage (from inside tutorials directory):
python breakpoint_indices.py -a ../../data/influenza-A/influenza-A.assembly.fasta -u unaligned.txt --alpha 5
"""

import sys
from optparse import OptionParser
from math import sqrt, pi, exp


#reads an assembly from a FASTA file as one contiguous string. Returns said string.
def readAssembly(assembly_file):
infile = open(assembly_file,'r')

assembly=[]
headerCount = 0
for line in infile:
#if this line is a header line, skip it
if line[0]==">":
headerCount+=1
continue
else:
assembly.append(line)


print 'Read %d headers' % headerCount

assemblyString = "".join(assembly)
strippedString = assemblyString.strip().replace('\n','')
return strippedString

#reads in a list of singletons from a FASTA file. Returns list of strings.
def readSingletons(unaligned_file):
infile = open(unaligned_file,'r')

singletons=[]
currentSingleton=[]
singletonCount=0
for line in infile:
if line[0]==">":
singletonCount+=1

#if the current singleton is not the empty list, join it and add it to the singletons list
if len(currentSingleton)!=0:
singletonString = "".join(currentSingleton)
strippedString = singletonString.strip().replace('\n','')
singletons.append(strippedString)

currentSingleton=[]
continue
else:
currentSingleton.append(line)

#get last singleton, since there are no more header lines
singletonString = "".join(currentSingleton)
strippedString = singletonString.strip().replace('\n','')
singletons.append(strippedString)

return singletons


# given a singleton, an assembly, and a windowsize alpha
# returns a list of indices where the singleton matches the assembly

def matchSingleton(singleton,assembly,alpha):
#ensure alpha is a small window
if alpha >= len(singleton):
print 'Alpha = '+str(alpha)+' is greater than length of singleton '+singleton
sys.exit()

#get alpha string
alphaString = singleton[0:alpha]

#result set of indices
indices=[]

resultIndex = 0
currentStart=0
while resultIndex != -1:
resultIndex = assembly.find(alphaString,currentStart,len(assembly))
if resultIndex != -1:
print 'Found match at position: '+str(resultIndex)
#now find where singleton stops matching assembly
curInd = resultIndex
for char in singleton:
if char == assembly[curInd]:
curInd+=1
print 'matches up to %d' % curInd
else:
print 'distance: %d' % (curInd - resultIndex)
break

indices.append(curInd)
currentStart=resultIndex+1

return indices

# given a list of singletons, an assembly, and a windowsize alpha
# returns an array where each index represents the number of singletons that match at that point
# with that alpha
def naiveBreakpointDetect(singletons,assembly,alpha,outputFile=None):
#initialize empty array
matchArray = []
for i in range(len(assembly)):
matchArray.append(0)

#for each singleton, increment matchArray everywhere a singleton matches
i=1
for s in singletons:
print 'Processing singleton ' + str(i) + ' of ' + str(len(singletons))
matchIndices = matchSingleton(s,assembly,alpha)
for index in matchIndices:
matchArray[index]+=1
i+=1

if outputFile != None:
outputStream = open(outputFile,'w')


errorCount = 0
for i in range(len(matchArray)):
if matchArray[i] > 0:
outputStream.write("error_"+str(errorCount)+"\t"+str(i)+"\t"+str(i)+"\tbreak\t"+str(matchArray[i])+"\n")
errorCount+=1

outputStream.close()


return matchArray

# generates a gaussian filter of size n with stdev 1
def gauss(n,sigma):
r = range(-int(n/2),int(n/2)+1)
return [1 / (sigma * sqrt(2*pi)) * exp(-float(x)**2/(2*sigma**2)) for x in r]

# given an array of numbers, an index at which to add the gaussian,
# the size of the window, and the stdev of the gaussian
# returns the array with the gaussian added
def addGaussian(arr,index,n,sigma):
gaussian = gauss(n,sigma)
startInd = index - (int(n/2))
endInd = index + (int(n/2))

gaussianCenter = int(n/2)

for i in range(gaussianCenter):
distance = gaussianCenter - i
arrayIndex = index-distance
if arrayIndex > 0 and arrayIndex < len(arr):
arr[arrayIndex]+=gaussian[i]

for i in range(gaussianCenter,len(gaussian)):
distance = gaussianCenter - i
arrayIndex = index-distance
if arrayIndex > 0 and arrayIndex < len(arr):
arr[arrayIndex]+=gaussian[i]

return arr


# given a list of singletons, an assembly, and a windowsize alpha
# returns an array where each index represents the number of singletons that match at that point
# with that alpha
def gaussianBreakpointDetect(singletons,assembly,alpha,n,sigma,outputFile=None):
try:
n = int(n)
except:
print 'n was: '+str(n)
raw_input()
sys.exit()

#initialize empty array
matchArray = []
for i in range(len(assembly)):
matchArray.append(0)

#for each singleton, increment matchArray everywhere a singleton matches
i=1
for s in singletons:
print 'Processing singleton ' + str(i) + ' of ' + str(len(singletons))
matchIndices = matchSingleton(s,assembly,alpha)
for index in matchIndices:
addGaussian(matchArray,index,n,sigma)
i+=1

if outputFile != None:
outputStream = open(outputFile,'w')

outputStream.write("ContigName\tStartError\tEndError\ttype\tconfidence\n")

errorCount = 0
for i in range(len(matchArray)):
if matchArray[i] > 0:
outputStream.write("error_"+str(errorCount)+"\t"+str(i)+"\t"+str(i)+"\tbreak\t"+str(matchArray[i])+"\n")
errorCount+=1

outputStream.close()


return matchArray




def Main():
parser = OptionParser()
parser.add_option("--algorithm",dest = "algorithm")
parser.add_option("-a","--assembly-file", dest = "assembly_file")
parser.add_option("-u","--unaligned-file",dest = "unaligned_file")
parser.add_option("--alpha",dest = "alpha")
parser.add_option("-o","--output-file",dest = "output_file")
parser.add_option("--gauss-window",dest = "gauss_window")
parser.add_option("--sigma",dest = "sigma")

(options, args) = parser.parse_args()

if not options.assembly_file:
print 'Provide assembly file (-a, --assembly-file)'
sys.exit()

if not options.unaligned_file:
print 'Provide unaligned reads file (-u, --unaligned-file)'
sys.exit()

if not options.alpha:
print 'Provide alpha (--alpha)'
sys.exit()

output_file=None
if options.output_file:
output_file=options.output_file

alpha = int(options.alpha)

print 'Using alpha of: '+str(alpha)

#read assembly
assemblyString = readAssembly(options.assembly_file)

print 'Successfully read assembly of length %d' % len(assemblyString)

#read singletons
singletons = readSingletons(options.unaligned_file)

print 'Successfully read %d singletons' % len(singletons)


#match singletons to assembly
if options.algorithm == "naive":
matchArray = naiveBreakpointDetect(singletons,assemblyString,alpha,output_file)
elif options.algorithm == "window":
if not options.gauss_window:
print 'Provide --gauss-window'
sys.exit()
if not options.sigma:
print 'Provide --sigma'
sys.exit()

n = int(options.gauss_window)
sigma = float(options.sigma)

matchArray = gaussianBreakpointDetect(singletons,assemblyString,alpha,n,sigma,output_file)
else:
print '--algorithm must be naive or window'
print options.algorithm
sys.exit()



if __name__=="__main__":
Main()
4 changes: 4 additions & 0 deletions breakpoint-detection/cleanUp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash

rm -f inf*
rm -f una*
30 changes: 30 additions & 0 deletions breakpoint-detection/generate_unaligned.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""
Generates a file from an assembly file which is the unaligned reads from bowtie2.
"""

import sys
import subprocess

if (len(sys.argv) < 4):
print 'Provide an input assembly file, a reads file, and an output file.'
sys.exit()

assembly_file = sys.argv[1]
reads_file = sys.argv[2]
output_file = sys.argv[3]

# First build the index that bowtie2 will use the align the reads.
execString = "bowtie2-build <ASSEMBLY_FILE> assemblyIndex.bt2"
execString = execString.replace("<ASSEMBLY_FILE>",assembly_file)

subprocess.call(execString,shell=True)

# Run bowtie2 to get the alignments.
execString = "bowtie2 --un <OUTPUT_FILE> -x assemblyIndex.bt2 -f -U <READS_FILE> -S output.sam"
execString = execString.replace("<READS_FILE>",reads_file)
execString = execString.replace("<OUTPUT_FILE>",output_file)

subprocess.call(execString,shell=True)

execString = "rm -f *assemblyIndex*"
subprocess.call(execString,shell=True)
9 changes: 9 additions & 0 deletions breakpoint-detection/generate_unaligned.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash

# First build the index that bowtie2 will use the align the reads.
bowtie2-build ../data/influenza-A/influenza-A.assembly.fasta influenza-A.bt2

# Run bowtie2 to get the alignments.
bowtie2 --un unaligned.txt -x influenza-A.bt2 -f -U ../data/influenza-A/influenza-A.sequences.fasta -S influenza-A.sam

rm -f inf*
12 changes: 12 additions & 0 deletions breakpoint-detection/generate_unalignedtxt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

# First build the index that bowtie2 will use the align the reads.
bowtie2-build ../data/influenza-A/influenza-A.assembly.fasta influenza-A.bt2

# Run bowtie2 to get the alignments.
bowtie2 --un unaligned.txt -x influenza-A.bt2 -f -U ../data/influenza-A/influenza-A.sequences.fasta -S influenza-A.sam

rm -f inf*

# Quick and dirty command to get the starting point and binning it in 100bp segments.
#awk '{print int($4/100)}' influenza-A.sam | sort -nr | uniq -c > influenza-A.cov
Loading