Project 3: A CSV Parser
Reading time: 20 minutes
Difficulty Level: ★★★★★
Start early; Start Often!
Due date: April 23 23:59
Updates and Revisions
- [April 22 09:51] Updated walker pointer example. See Edstem post.
- [April 21 14:51] Added clarification of parse() input.
- [April 21 13:58] Added submission instructions.
- [April 19 15:56] Added sections on empty columns and testing.
Table of contents
Learning Goals
In this assignment, we will–
- learn about CSV files,
- start using pointers to access and modify data,
- practice more advanced string manipulation,
- learn how to parse command line arguments using
getopt(), - use
fgets()to read from files, - test our own code for correctness, and of course,
- practice using the terminal and vim some more.
Introduction
This time, we will be writing a program to read and parse CSv (comma-separated values) files.
CSV Files
This is a file format that uses comma-separated values to store data. You can simply think of a CSV file as a table with rows and columns. Each line in a CSV file represents a row, and the columns in each row are separated by a delimiter, which in the case of CSV files, is a comma (',').
Examples
Consider the following table:
| Name | Race | Age | Weapon |
|---|---|---|---|
| Frodo Baggins | Hobbit | 51 | Sting |
| Aragorn | Human | 87 | Anduril |
| Gandalf | Istari | Unknown | Staff |
| Samwise Gamgee | Hobbit | 38 | Cast iron pan |
To store this table using comma-separated values, we get:
Name,Race,Age,Weapon
Frodo Baggins,Hobbit,51,Sting
Aragorn,Human,87,Anduril
Gandalf,Istari,Unknown,Staff
Samwise Gamgee,Hobbit,38,Cast iron pan
Empty Columns
It is totally acceptable to have empty columns in a CVS file: There would simply be no value between two commas for an empty column:
Example:
| Name | Race | Age | Weapon |
|---|---|---|---|
| Gandalf | Istari | Staff | |
| Samwise Gamgee | Hobbit | 38 |
Converting the above table to csv, we get
Name,Race,Age,Weapon
Gandlaf,Istari,,Staff
Samwise Gamgee,Hobbit,38,
You might ask, what if some data cell contains the delimiter character? Consider a CSV file using commas to separate values. What if we want to store the following address?
6755 Mira Mesa Blvd Ste 111, San Diego, CA 92121If we store it directly, then it would be interpreted as three separated values.
To address this, values containing the delimiter character would have quotes around them (e.g.,
"Lorem, ipsum").However, in this PA, you can assume such conflicts will not happen.
The select program
Your task for this programming assignment is to create a command line program called select. This program reads CSV files from standard input (just like in PA 2) and selects desired columns.
The program is run like so:
Usage: ./select -c NCOLS [-n NRECS] COL1 COL2 COL3 ...
Let’s break it down–
./select: The name of the executable program.-c NCOLS: The number of columns to read from the input CSV.[-n NRECS]: Optional. The maximum number of rows to parse.COL1 COL2 COL3 ...: The indices of columns to output (1-indexed).
Usage Example
Consider the same CSV file from an earlier section. Let’s call this file lotr.csv:
Name,Race,Age,Weapon
Frodo Baggins,Hobbit,51,Sting
Aragorn,Human,87,Anduril
Gandalf,Istari,Unknown,Staff
Samwise Gamgee,Hobbit,38,Cast iron pan
To select the Name and Weapon columns, we can run the following:
$ ./select -c 4 1 4 < lotr.txt
Name,Weapon
Frodo Baggins,Sting
Aragorn,Anduril
Gandalf,Staff
Samwise Gamgee,Cast iron pan
- The
-c 4argument specifies there are 4 columns to read from the input CSV. - The subsequent
1 4indicate the columns we wish to select and output. < lotr.txt, as you will recall from last time, redirects the contents of thelotr.txtfile as input to theselectprogram. (This is a feature of the shell, not something you need to implement.)
The output column indexes can be repeated:
$ ./select -c 4 1 2 1 3 4 4 < lotr.txt Name,Race,Name,Age,Weapon,Weapon Frodo Baggins,Hobbit,Frodo Baggins,51,Sting,Sting Aragorn,Human,Aragorn,87,Anduril,Anduril Gandalf,Istari,Gandalf,Unknown,Staff,Staff Samwise Gamgee,Hobbit,Samwise Gamgee,38,Cast iron pan,Cast iron pan
Processing Command Line Arguments
The first task for you is to implement command line arguments parsing.
The Anatomy of Command Line Arguments
Generally speaking, there are three kinds of command line arguments. Consider the following compiling command to compile a simple C file hello.c into an executable program called hello, with debugging features turned on (-g):
$ gcc -o hello -g hello.c
In this command, we see 3 kinds of arguments:
- Flag Only: This would be like the
-gflag when we compile a program with gcc. No value is specified. - Flag + Value: This would be like the
-o fileflag when we specify the executable name when compiling with gcc. - Positional arguments: This would be the final
hello.c. It does not follow any flag like-gor-o.
Let’s revisit the command line arguments that our select program supports:
Usage: ./select -c NCOLS -n NRECS COL1 COL2 COL3 ...
Here, we have:
-c NCOLSis a flag + value combination option,-n NRECSis also a flag + value combination option, andCOL1 COL2 COL3 ...are all positional arguments.
The main function: argc and argv
You should be familiar with the Java way of handling command line arguments. As you no doubt recall, the Java main method looks like this:
public static void main(String[] args) {
// ...
}
The String[] args is an array of Strings, each representing one command line option passed in from the terminal. The number of arguments can be deduced by checking the length of the args array (args.length).
In C, however, arrays do not store their lengths. So there is no such thing as args.length in C. This is why we have two separate parameters in the main function:
int argc: the number of command line arguments (lit., “argument count”).char *argv[]: an array of command line arguments stored as strings (lit., “argument values). Recall that in C, strings are of typechar *, soargvis an array ofchar *values.
Parsing Command Line Arguments Using getopt()
If you look at our starter code for PA 2, you will notice the use of the getopt() function in the beginning of main() to proces command line arguments.
Now, let’s learn about using getopt() so you can write your own command line arguments handling code for this PA.
Suppose we wish to write a program named prog that can be run like this:
./prog -a -b NUM ARG1 ARG2 ARG3 ...
Where -a is a flag-only option, -b NUM is a flag + value option, specifically, it accepts a number as an argument (so we will need to parse the string into a number), and everything after these (ARG1 ARG2 ARG3 ...) are positional arguments.
We can write the following code to process these arguments:
int a_flag = 0; // 1 if -a flag is set, 0 if not set
int b_value = 0; // stores the value of -b flag
int c;
while ((c = getopt(argc, argv, "ab:")) != -1) {
switch (c) {
case 'a': // -a flag detected
a_flag = 1;
break;
case 'b': // -b flag detected
b_value = atoi(optarg); // convert argument value to int
break;
}
}
// After getopt processes all the flags, the positional arguments
// begin at index `optind`:
for (int i = optind; i < argc; i++) {
printf("argv[%d] = %s\n", i, argv[i]);
}
The getopt function call
We call getopt in a while loop and give it both argc and argv. To specify the flags to scan for, we also pass in a string "ab:".
The "ab:" string tells the getopt function that there are two flags to look out for: -a and -b. In addition, the colon (':') in "b:" indicates that the -b flag should be followed by a value.
In the body of the while loop, we use a switch-case statement to process specific flags.
The switch-case statement is another kind of conditional statement. In the code snipppet above, the switch-case statement is equivalent to
if (c == 'a') { a_flag = 1; } else if (c == 'b') { b_value = atoi(optarg); }When writing a switch-case statement, do not forget the
break;statement at the end of each case!
Using optarg
Note that when handling the -b flag, we use the optarg to obtain the value provided by the user from the command line.
This variable is defined by the library. So you do not need to declare it yourself. (There is no declaration for optarg in our code snippet above.)
The optarg variable is a string (char *). If you want the value to be an integer, you will need to use the atoi() function to convert it. You can see it in action in the example above.
Handling Remaining Positional Arguments Using optind
Once the while loop running getopt finishes, all the flags will have been processed. All that remains to be done is processing the positional arguments, but where do they begin?
For this, the library defines another integer variable int optind, which stores the index of the next argument to be processed. After getopt finishes processing all the flag arguments, optind would store the index of the next arguments, which is our first positional argument.
We then use a for loop to process all the remaining arguments.
String Splitting: The Basics
So how does parsing a CSV file actually work? As you will later see in the starter code, the CSV file is processed one line at a time, and one line is one string. So our task is boiled down to this: how do we break down a string into smaller parts? To understand this, we need to understand how a string is stored in memory.
Let’s take this hypothetical line/row from our CSV file:
Frodo,Hobbit,51\n
(Notice the newline character ('\n') at the end!)
When our program runs, this line will be read from the file into a string. And as you should know by now, a string in C is a null-terminated array of characters.
Let’s draw some ASCII art to represent the array of characters:
char *line = "Frodo,Hobbit,51";
↓
│ F │ r │ o │ d │ o │ , │ H │ o │ b │ b │ i │ t │ , │ 5 │ 1 │ \n │ \0 │
You can see that the characters are laid out sequentially in memory, and the entire string is terminated by a null character ('\0'). The variable line is actually a pointer to the very beginning of the string, i.e., the first character.
Our goal is to end up with 3 separate null-terminated strings: "Frodo", "Hobbit", and "51".
To achieve this, we can simply replace the delimiter (',' and '\n') with null characters! If we do that, we get the following:
char *line = "Frodo,Hobbit,51";
↓
│ F │ r │ o │ d │ o │ \0 │ H │ o │ b │ b │ i │ t │ \0 │ 5 │ 1 │ \0 │ \0 │
↑ ↑ ↑
str1 str2 str3
Now, we have three null-terminated strings: str1 which is "Frodo", str2 which is "Hobbit", and str3 which is "51".
As you will see in later sections, we want to break down the original string into an array of strings. So in this same example, the original string is:
char *line = "Frodo,Hobbit,51";
We want to end up with an array of strings called values:
char *values[] = {"Frodo", "Hobbit", "51"};
It is critical that you understand the logic here for splitting strings completely. If not, you will not be able to do this assignment. If you need help understanding, please utilize office hours and tutor hours liberally!
Getting Started
Now that we understand our goal, let’s discuss the implementation.
Again, the starter code for this assignment is hosted on GitHub Classroom. Use the following link to accept the GitHub Classroom assignment:
Click here to accept this GitHub Classroom assignment. (Right click to open in new tab.)
Just like last time, clone the repository to your ieng6 server. (Do not clone the repo to your local machine.)
Introduction to Code Modularization and the Makefile
Until now, we have used the gcc command directly to compile a single file. However, as you may know, more complicated programs benefit from modularization. This means the source code is spread across multiple files.
For this PA, the code is spread across two files:
main.c: Takes care of argument parsing and error handling.parse.c: The core CSV parsing logic.
With a modularized C program consisting of multiple C files, we no longer type out the gcc command ourselves when we build the executable. Instead, we rely on a build tool called make, which will follow the build recipes described in the Makefile.
To compile the program, simply type make in the terminal. Since the starter code is incomplete, you may get compiler warnings/errors when you try to compile.
Do not attempt to compile the program by running gcc on either of the .c files! Only use make to compile. (It’s easier anyway.)
Next, let’s look at the starter code, starting with main.c:
Task 1 (main.c): Writing getopt
Your first task is writing the code to handle command line arguments. As a reminder, our select program is run as follows:
Usage: ./select -c NCOLS [-n NRECS] COL1 COL2 COL3 ...
We have written the skeleton of the getopt() call and the while loop, just like we showed you earlier in this document. You task is to write the body of the while loop so that the program correctly parses the command line arguments, and stores the values in the corresponding variables ncols and nrows.
After the while loop, there are some error checking code to make sure the arguments are entered correctly. (E.g. the user never entered the number of columns via the -c flag, or if they did not specify any output columns.)
Task 2 (main.c): Setting outidx
The outidx array is an array of integers. This array keeps track of the columns to be selected. (idx is a commonly used shorthand for “index”.)
int outidx[MAX_COLS];
The important thing to note here is that the array is initialized to be quite large: MAX_COLS is defined to be 100, which means we can specify up to 100 columns to output. (You can assume that this limit will not be exceeded.) However, you don’t have to use all 100 spaces in this array.
For instance, if the program is run like so:
$ ./select -c 10 -n 50 1 3 4 6 7 9 10
(Translation: the CSV input has 10 columns in each row, process the first 50 rows, and select the following columns: 1, 3, 4, 6, 7, 9, and 10.)
In this case, the outidx array will look like {1, 3, 4, 6, 7, 9, 10, ...}. The ... here means that technically the array has more spaces, but they are irrelevant for the program.
In fact, you will see in the starter code a count variable, which keeps track of how many columns are supposed to be selected. So the outidx array is really only meaningful up to index count - 1. In the previous example, count would have a value of 7, and outidx[0] = 1, outidx[1] = 3, outidx[2] = 4, …, outidx[6] = 10.
Task 3 (main.c, parse.c): Processing input
In the last part of the main() function, you will find another while loop. This should look familiar to you, as we also had a similar while loop in the previous PA, which also called fgets() to get output from the terminal.
The string you get from fgets() includes a newline character ('\n')!
This while loop does two things:
- Call the
parse()function to break down one line of input into thevaluesarray, and - Print the selected columns using the
valuesarray andoutidxarray.
If the
parse()function returns -1 while processing line N of the CSV input, you should print the following error message tostderrand exit the program with a failure code:“Invalid syntax at line N”
where N is replaced by the actual line number.
The code looks something like this:
if (parse(/* arguments here */) == -1) { fprintf(stderr, INVALID_SYNTAX_MSG, lineno); return EXIT_FAILURE; }The
INVALID_SYNTAX_MSGis a macro:#define INVALID_SYNTAX_MSG "Invalid syntax at line %d.\n"
Note that here is also where you should handle the -n command line argument, which specifies how many rows of input should be processed.
In the following section, we dive into the most important part of this assignment:
The parse() Function
Go to parse.c and you will see the parse() function:
int parse(char *line, int ncols, char *values[])
The comment above the function documents what each parameter does.
To visualize what the parse() function does to the line of CSV data that it gets, let’s return to the previous example of "Frodo,Hobbit,51\n".
char *line = "Frodo,Hobbit,51";
↓
│ F │ r │ o │ d │ o │ , │ H │ o │ b │ b │ i │ t │ , │ 5 │ 1 │ \n │ \0 │
In the example above, the first parameter line is "Frodo,Hobbit,51\n", and the second parameter ncols is 3 because there are three columns in the CSV file.
We do not care what data is in the values array at the beginning, because we are using it to store the results of our own computations.
Your parse() function should be able to handle an input line that ends with a newline character! If not, your code will fail some autograder test cases!
The high-level goal of this function is to take this line of CSV data, split it by the delimiter (i.e., the comma (',')), and store the split strings (i.e. each column value "Frodo", "Hobbit", "51”) into the values array.
When we say “store a string”, what’s really happening is we are storing a pointer to the start of the string, i.e., a char * (character pointer).
In the end, we should have values[0] = "Frodo", values[1] = "Hobbit", and values[2] = "51". However, it is more important to realize what that looks like in memory:
char *line = "Frodo,Hobbit,51";
↓
│ F │ r │ o │ d │ o │ \0 │ H │ o │ b │ b │ i │ t │ \0 │ 5 │ 1 │ \0 │ \0 │
↑ ↑ ↑
values[0] values[1] values[2]
As you can see, values[0] stores a pointer to the start of "Frodo", values[1] stores a pointer to the start of "Hobbit", and so on.
No new strings are allocated in this process. In other words, we have not allocated more memory for these three values. We are merely manipulating the original string to split it into three strings (by putting in null characters!)
Finding Delimiters Using Pointers!
To find all the delimiter (commas) in the line and replace them with null characters, we will simply traverse the string from start to finish, and process one character at a time.
You are only allowed to use a pointer to traverse the string! You will not receive any points for this PA if you used index-based traversal.
How do you traverse a string using pointers? Here’s a very simple example:
// Use a pointer to print out a string character by character
char *str = "This PA is hard!!";
char *c = str; // this is the pointer that will "walk through" the string
while (*c != '\0') { // as long as the null terminator hasn't been reached
printf("%c", *c); // print out each character
c++; // "walk" to the next character
}
printf("\n"); // print a newline after everything
Let’s again draw a diagram for this code:
char *str = "This PA is hard!\n";
↓
│ T │ h │ i │ s │ │ P │ A │ │ i │ s │ │ h │ a │ r │ d │ ! │ \0 │
↑
char *c = str;
As you can see, str is a character pointer that points to the start of the string/array of characters. When we assign the walker pointer c the value of str, we are basically saying they both point to the same thing. So, as you can see in the diagram, both pointers str and c point to the same position.
When we want to access the character at the pointer, we simply dereference: *c.
To traverse the string, we need to make the c pointer walk through the string. So we need to advance it one character at a time: c++. (Not the programming language!)
After you increment the c pointer by doing c++, it’ll look like this:
// after c++
char *str = "This PA is hard!\n";
↓
│ T │ h │ i │ s │ │ P │ A │ │ i │ s │ │ h │ a │ r │ d │ ! │ \0 │
↑
c (incremented)
You will implement your parse() function based on this traversal! Of course, your code will be much more complicated than this because it needs to do much more.
What If 1: There are more columns than
ncols?This is perfectly acceptable. You can simply stop processing after
ncolsnumber of columns have been processed and you have obtainedncolssubstrings.What If 2: There are fewer columns than
ncols?This would be an error. Because you are expecting at least
ncolscolumns to be present in the string, but after traversing the entire string, you still have not found enough columns.If that is the case, the
parse()function should return -1.
Pointers are hard to understand. They are also the most powerful feature of the C programming language, and arguably the essence of this class.
Do not hesitate to come to office hours for help.
And as you have probably realized while reading this writeup, diagrams are really helpful! You should draw a lot of diagrams if you have trouble working with pointers. Be patient. It takes practice.
Developing, Testing, and Debugging
Rule #1 is always run your code. You are not going to know anything about your code unless you run it! Once again we reiterate the importance of incremental development and testing.
If you don’t develop and test your program incrementally, bugs will pile up and bury you underneath!
To help you test your code, we have provided some example csv files in the starter repository. Just like in PA 2, you can feed these files into your program using redirection.
We have also provided you with a reference implementation. Please use this to check your own implementation!
More instructions will become available on testing.
Testing Your Own Code
In principle, we should always make sure the general use case for a program is working before moving on to testing edge cases/corner cases.
Testing the correctness of your code is about asking the right questions. Here are some questions you should ask yourself when you are testing your code:
- Do I understand how the program is run?
- Does my program correctly parse the command line arguments?
- Are the corresponding variables set correctly?
- Are all the selected columns stored correctly in the
outidxarray?
- Does the program read each line correctly?
- Is the program able to correctly identify delimiters?
- Is the program splitting up substrings (columns) correctly?
- Is the
parse()function filling in thevaluesarray correctly? - Can my program handle duplicate selected columns?
This is in no way an exhaustive list of things you should think about as you test your program. It is your job to test your program extensively before submitting to gradescope.
Do not use Gradescope as a way to verify your implementation. You should be reasonably confident in your own code (having tested it a lot) before even making the first submission!
Committing Using Git
As we have explained in discussion and in lab, git is a great way to do version control. And it goes naturally with incremental development.
Every time you finish a small functionality, you should commit your code.
To commit your code, you need to be in your code repository directory, and run the following command:
$ git commit -am 'commit message here.'
The -am flag is a combination of -a (add all files) and -m (commit message).
For the commit message, you should write something that is at least minimally meaningful. Here are some decent examples:
- Implemented key1 validation and error message.
- Fixed minor bug with key1 validation.
- Implemented loop to covert all input to lower case letters.
- Implemented basic version of encrypt.
Here are some not so great examples:
- .
- asdfasd
- why won’t you work
- aaaarrrgggghhh
- stupid bug
- ******* ****
- wrote some ****
All of these are commit messages that I have written throughout my short career, but you should aim to write yours more like the first list. At least when you are in a decent mood.
Pushing to GitHub
Every once in a while, maybe even after each commit, it would be a good idea to push your progress to GitHub in case the CSE building burns down and the ieng6 servers are destroyed.
To do that, navigate to your code repository directory, and run
$ git push
Simple as that.
Submission
Submitting from GitHub
First, make sure that the most up-to-date version of your code has been pushed to GitHub. If you want to double check, go to the GitHub web page and find your repository to make sure the code changes you made on ieng6 are actually all there.
Having done that, go to Gradescope and find the submission page for this PA. When you submit the assignment, you will see the option to submit from a specific GitHub repository. You will need to link your GitHub account first, after which you should be able to find your pa2 repository. Select it, and choose the main branch.
We have not discussed branching in git yet, so unless you know what you are doing, you should only see main for branch options.
Gradescope Autograder
You score will always be 0 before the deadline! Please see the results for specifc test cases to understand how your code performed.
After you submit your PA on gradescope, you’ll notice there are a set of public test cases and a (significantly larger) set of hidden test cases. As the names would imply, you will only have access to the public test cases, though you will be graded on the result of ALL the test cases.
Note that when we test your parse.c implementation, we have our own simple test program (instead of using the more complicated select program) called parse.