< All Topics

Importing data from a text file

The problem

We have been given a text file with some survey data in it.  While you can easily read text files in Muk3D the general expectation is that the first 3 columns are x,y,z (values can be separated by spaces, commas, or tabs).

In this example file, the first column is a point classification, the second column is y, third is z, and fourth is x.  In order to import this file, we will need to manipulate the column order and create a new text file that is able to be imported.

Opening a file

A file can be opened for reading or writing by using the python function open.

with open('data.txt', 'rt') as filehandle: 
    print filehandle.readline()

The first line of this code opens the file and assigns it to the variable filename.  The variable filename is just a handle that allows us to access the data in the file.  The ‘rt’ argument in the statement opens the file for reading and tells the command that it is a text file.

When we run this script, a single line of data will be written to the output window:

Point classification, y, z, x

This is the header row from the text file.

We could have just opened the filename using the syntax:

filehandle = open('data.txt', 'rt') 
# do stuff with the filehandle 
filehandle.close() 

The with statement will close the file once all the indented code beneath it has been executed.  The second method where result of the open command was directly assigned to the variable filename will not automatically close the file, and so if we adopt this approach then we’d have to  explicitly close the file when we are done with it using the method close().

The next line in will print a line of data read from the file.  There are a number of ways we can read text from the file:

all_data = filehandle.read()
one_line = filehandle.readline()

Looping through all the rows

Since we want to pull in all values, we need to loop through all lines in the file.  We can do this using a for loop.  The filename variable supports iteration by line, which means that we can step through the file in a for loop line by line.

with open('data.txt', 'rt') as filehandle:    
    for line in filehandle:  
        print line  
        break  

We’ve added the keyword break in here so that the script doesn’t loop over our whole file for now.  If we didn’t have that in there, every time we execute the script it would loop through each line in the file and print it to the output.  Our example file has 1 million points, and so it’ll want to print out each row.

When this is run, we get the same output as before (the header row) since we’re just reading out the first row, and then break ends the for loop.

Rather than break after the first line has been read and printed from the text file, we can use a counter to end the for loop after the 5th row has been printed.

A variable called counter is initialsed to 0 above the for loop.  After each line is printed, the counter is incremented by 1 and checked to see if its value is 5.  If it is 5, then the break keyword ends the for loop.

with open('data.txt', 'rt') as filehandle:    
    counter = 0
    for line in filehandle:  
        print line  
        counter += 1
        # could also increment by:
        # counter = counter + 1
        if counter == 5:
            break

When we check to see if values are equal in Python we use ==.  If we were to use a single = then that would try and assign the value of 5 to the variable counter and Python would raise an exception (error).

When this script is run, it should print 5 rows of data.

Point classification,y,z,x
8,-1933.38,83.006,5289.185
8,-4544.083,98.075,3989.627
7,-4224.925,97.55,3991.804
4,-2949.443,97.533,4646.521

Splitting the text

The variable line in the code above is assigned as the text on each line, as it loops though all lines in the file.  The String class (of which line is an instance) in Python has a method called split that we can use to break the single line of text into a list by specifying the delimiter, in this case, a comma.

with open('data.txt', 'rt') as filehandle:    
    counter = 0
    for line in filehandle:  
        values = line.split(',')
        print values
        
        counter += 1
        
        if counter == 5:
            break

The variable values is being assigned a list of the row entries using a comma as the separation point.  The following output is generated.

['Point classification', 'y', 'z', 'x\n']
['8', '-1933.38', '83.006', '5289.185\n']
['8', '-4544.083', '98.075', '3989.627\n']
['7', '-4224.925', '97.55', '3991.804\n']
['4', '-2949.443', '97.533', '4646.521\n']

Firstly, instead of being the string, its printing a list (a list is an array with the [].  The values are printed in ‘’ so we know that they are strings.  Finally, the last entry has \n tacked on it.  What is that?

\n is a formatting character that asks for a carriage return and a new line.  Think about a typerwriter.  Once you get to the end of the line you can either go to the next line (line feed LF) or go back to the start of the current line (carriage return CR) or both (CRLF).

In this case, because all the entries in the text file are on separate lines, there is a hidden character there that represents the CRLF.  In Notepad++, if you click the character that shows formatting, you’ll see that at the end of each line there is a CRLF character, which is the \n that we’re seeing here.

We need to get rid of it.   Even though its invisible, its still a character (ASCII Code is???) and it’ll mess things up when we try and convert text in to numbers later.

Fortunately, the string class has a method called strip that will strip characters out of a string.  We may as well do this to the line variable before we split it.

with open('data.txt', 'rt') as filehandle:    
    counter = 0
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        print values
        
        counter += 1
        
        if counter == 5:
            break

Note that it’s a backslash preceding the n. If you use a forward slash, you’ll get a completely different result. In the code, we’ve also made it so its clean_line being split instead of line. When we run this, we get the following output:

['Point classification', 'y', 'z', 'x']
['8', '-1933.38', '83.006', '5289.185']
['8', '-4544.083', '98.075', '3989.627']
['7', '-4224.925', '97.55', '3991.804']
['4', '-2949.443', '97.533', '4646.521']

with open('data.txt', 'rt') as filehandle:    
    filehandle.readline()
    counter = 0
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        print values
        
        counter += 1
        
        if counter == 5:
            break

The output generated looks the same as we had before – 5 rows, but the first row are numbers not the header data.

['8', '-1933.38', '83.006', '5289.185']
['8', '-4544.083', '98.075', '3989.627']
['7', '-4224.925', '97.55', '3991.804']
['4', '-2949.443', '97.533', '4646.521']
['3', '-4413.829', '110.263', '1381.409']

Converting the text values to numbers

So far we can split the data in each line, but currently all the values are strings (text).  These need to be converted to numbers.  The variable values  is assigned a list of the text in each line.  To access a value in a list we use the [ ]  operator.

The columns in this file are ordered:

  1. Point classification
  2. y coordinate
  3. z coordinate
  4. x coordinate

When we get values we need to get the correct element.  In Python, arrays are zero indexed.  This means that the first element is referenced as value 0, the second as value 1, etc.  So for this instance the x coordinate is element 3, y is element 1, z is element 2, and point classification is element 0.

with open('data.txt', 'rt') as filehandle:    
    filehandle.readline()
    counter = 0
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        
        x = values[3]
        y = values[1]
        z = values[2]
        classification = values[0]
        
        print x, y, z
        counter += 1
        
        if counter == 5:
            break

When the values are printed here, we’ve printed the x, y, z values using a single print statement. When this is done, the values are separated by commas and printed with a space between each value.

5289.185 -1933.38 83.006
3989.627 -4544.083 98.075
3991.804 -4224.925 97.55
4646.521 -2949.443 97.533
1381.409 -4413.829 110.263

The values for x, y, z, and Point classification are still represented as strings, so the next step is to convert into numbers.

We know that the coordinate values are going to be floating point (decimal) numbers and in Python they are known as the type float.  The classification column is an integer (whole number, no decimals) and in Python they are known as type int.

To convert a string to a float, we can use the use the built-in function float.  For integers, the corresponding built-in function is int.

with open('data.txt', 'rt') as filehandle:    
    filehandle.readline()
    counter = 0
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        
        x = float(values[3])
        y = float(values[1])
        z = float(values[2])
        classification = int(values[0])
        
        print x, y, z
        counter += 1
        
        if counter == 5:
            break

Now the variables x, y, & z contain floating point values, and the classification variable contains an integer.

Saving to a file

Now that we can go through and get the numerical values, we are going to write that data back to a file so that we can just import it into Muk3D.  There’s two things we’re going to do here.  The first is to open another file-handle, but this time it will be for outputting a text file (uses the ‘wt’ as an argument to write to a text file).  The with statement allows for a number of files to be created and have them automatically closed at the conclusion of the indented code following with.

The \ at the end of the first line lets us continue the with statement on the second line, just for readability.  If we didn’t use the \ Python will raise a SyntaxError which indicates that the source code is improperly formatted.

We have also changed the print statement to a write statement that takes the x, y, and z values, formats them as a string representation of floats, and then writes them as comma separated data.

with open('data.txt', 'rt') as filehandle, \
    open('output.txt', 'wt') as output_filehandle:    
    
    filehandle.readline()
    counter = 0
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        
        x = float(values[3])
        y = float(values[1])
        z = float(values[2])
        classification = int(values[0])
        
        output_filehandle.write('{:f}, {:f}, {:f}, \n'.format(x, y, z))  
        
        counter += 1
        
        if counter == 5:
            break

The data is written to the text file in line 15 above. The method write of the output_filehandle will take the text passed as an argument and write it into the file.  The text in this case uses Pythons new style string formatting commands to insert the x, y, and z values into a string to be written to the text file.  The {} are a place holder that will be filled with a value passed as an argument to the format method of the String class.  Because there are 3 place holders, the format command will need to have 3 values passed to it.  Within each {}, the text :f tells the format command that the argument should be written as a string representation of a floating point number.  This will write the number with all its decimal places.

The three place holders are comma separated so that once the text is written it can be loaded as a comma separated file.  The line of text ends with a newline character, \n, which ensures that the next text written to this file will be on a new line.  If we omitted this, then all the values would be written on a single line and be very difficult to parse later.

Adding a header row to the text file

So that we don’t get confused when looking at the text file later, we’ll add a header row to our output file.  When you’re importing XYZ data into Muk3D the header row will be ignored.

with open('data.txt', 'rt') as filehandle, \
    open('output.txt', 'wt') as output_filehandle:    
    
    output_filehandle.write('x, y, z, \n')
    filehandle.readline()
    counter = 0
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        
        x = float(values[3])
        y = float(values[1])
        z = float(values[2])
        classification = int(values[0])
        
        output_filehandle.write('{:f}, {:f}, {:f}, \n'.format(x, y, z))  
        
        counter += 1
        
        if counter == 5:
            break

Process the entire file

Once we’re happy that the data is being written correctly to the file, we can remove the counter and the break statement to let the script process the entire file.

ith open('data.txt', 'rt') as filehandle, \
    open('output.txt', 'wt') as output_filehandle:    
    
    output_filehandle.write('x, y, z, \n')
    filehandle.readline()
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        
        x = float(values[3])
        y = float(values[1])
        z = float(values[2])
        classification = int(values[0])
        
        output_filehandle.write('{:f}, {:f}, {:f}, \n'.format(x, y, z))

Once you run the command, you can drag the output.txt file into Muk3D and the point data should load.

Filtering values by classification

The first value in each row of our input data represented a point classification, ranging between 0 and 10.  The classification in this example is completely arbitrary but if we’re looking at a dataset such as LIDAR, each point might be classified based on what the point return represents (ground, vegetation, water, etc).

In this first step we’ll import points with a classification of 5.  A new variable has been created to represent our target_classification.  The output text file has been renamed to output-class-5.txt.  In row 17, we check the classification of the target point and only write it to our output file if matches the target classification.

target_classification = 5

with open('data.txt', 'rt') as filehandle, \
    open('output-class-5.txt', 'wt') as output_filehandle:    
    
    output_filehandle.write('x, y, z, \n')
    filehandle.readline()
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        
        x = float(values[3])
        y = float(values[1])
        z = float(values[2])
        classification = int(values[0])
        
        if classification == target_classification:
            output_filehandle.write('{:f}, {:f}, {:f}, \n'.format(x, y, z))  

Splitting the file into each classification

In the example above we’ve just pulled in points with a classification of 5.  If we want to bring in all data, but isolate each point classification, we can create output files for classifications 0 – 10 and then write to the appropriate file when a point is processed.

One approach is outlined in the script below.  A file is opened to represent each point classification (rows 1-3) and the header row is written (rows 5-7).  Rows 21-26 test for membership in one of the output classes and then writes the data if it matches. Rows 28-30 closes each of the output file handles.  We have to do this explicitly since they were created using the open function and not in a with statement.

output0_fh =  open('output-class-0.txt', 'wt')
output1_fh =  open('output-class-1.txt', 'wt')
output2_fh =  open('output-class-2.txt', 'wt')

output0_fh.write('x, y, z, \n')
output1_fh.write('x, y, z, \n')
output2_fh.write('x, y, z, \n')

with open('data.txt', 'rt') as filehandle:    
    
    filehandle.readline()
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        
        x = float(values[3])
        y = float(values[1])
        z = float(values[2])
        classification = int(values[0])
        
        if classification ==  0:
            output0_fh.write('{:f}, {:f}, {:f}, \n'.format(x, y, z))  
        elif classification ==  1:
            output1_fh.write('{:f}, {:f}, {:f}, \n'.format(x, y, z))  
        elif classification == 2:
            output2_fh.write('{:f}, {:f}, {:f}, \n'.format(x, y, z))  
        
output0_fh.close()
output1_fh.close()
output2_fh.close()

This example is only looking at 3 point classifications because there is a better way to approach this.  If we had 10 or 20 different classifications, then we’d have to manually create the file handle variables, add to the if/elif statement, and close the file handle.

Instead of explicitly defining each output file handle, we can use loops to create each output, then store the file handles in a list, indexed by point classification.

number_of_classifications = 11 # 0 - 10
file_handles = []

for classification_no in range(number_of_classifications):
    filename = 'output-class-{}.txt'.format(classification_no)
    fh = open(filename, 'wt')
    fh.write('x, y, z, \n')
    file_handles.append(fh)

with open('data.txt', 'rt') as filehandle:    
    filehandle.readline()
    for line in filehandle:  
        cleaned_line = line.strip('\n')
        values = cleaned_line.split(',')
        
        x = float(values[3])
        y = float(values[1])
        z = float(values[2])
        classification = int(values[0])
        
        file_handles[classification].write('{:f}, {:f}, {:f}, \n'.format(x, y, z))  
        
for fh in file_handles:
    fh.close()

In row 1, a variable holds the number of point classifications that we want to handle.  Row 2 is a list variable that will hold the file handles that are created.

In row 4 we create a for loop to loop through the number sequence 0 – 10 which is created by the range function.  In each step of the loop, the variable classification_no is assigned to the next value in the 0 – 10 sequence.

In row 5 the output file name is created.  The string format method is used to substitute the current value of classification_no into the filename.

In row 6 the file handle is opened using the filename. In row 7 the header row is written to the current file handle.

Finally, in row 8 the file handle is appended to the file_handles list.

In row 21 we access the appropriate file handle for the point classification and write the point data to it.  In this example this works well because the first point classification is 0 (remember, arrays are zero based).  In a later tutorial we’ll show how to use an arbitrary range or text values to index the file handles.

Finally in rows 23 and 24 we use a for loop to loop through each element in the file_handles array and then close that file handle.

Table of Contents