Several sharp eyes noticed some flaws in the OCaml code in my last post.
The main problem was pointed out by Kim Nguyễn in the comments:
your OCaml version is flawed… First in F# you do every computation in integer and only convert as float at the end (it seems). If you do the same in OCaml, you will gain some speed, that is:
list_sum should use 0 and +, and distance should use let x = a - b in x*x to compute the square and only call “float” before sqrt, as you do in F# (and therefore also call float at the end on num_correct since list_sum now returns an integer. Also the computation of num_correct should use now 1 and 0 instead of 1.0 and 0.0).
I made the suggested changes to distance and list_sum :
[ ] [title: integer math in distance and list_sum]
Those simple changes yielded a 2X speedup which puts the OCaml implementation on par with the non-parallelized F# version:
real 0m23.874s
user 0m23.623s
sys 0m0.079s
Then camlspotter pointed out something about float array that I’d forgotten:
In addition, comparing programs using Arrays and Lists is VERY unfair. In this kind of programs with loops you cannot ignore the cost of list construction and deconstruction. In addition, OCaml’s float array has another benefits: its contents are unboxed, while float list does not.
camlspotter was kind enough to submit an Arrayized version of the OCaml code:
(* OCaml version submitted by @camlspotter compile with: ocamlopt str.cmxa -o classifyDigitsArray classifyDigitsArray.ml *)(*// This F# dojo is directly inspired by the // Digit Recognizer competition from Kaggle.com:// http://www.kaggle.com/c/digit-recognizer// The datasets below are simply shorter versions of// the training dataset from Kaggle.// The goal of the dojo will be to// create a classifier that uses training data// to recognize hand-written digits, and// evaluate the quality of our classifier// by looking at predictions on the validation data.*)letread_linesname:stringlist=letic=open_innameinlettry_read()=trySome(input_lineic)withEnd_of_file->Noneinletrecloopacc=matchtry_read()with|Somes->loop(s::acc)|None->close_inic;List.revaccinloop[](*// Two data files are included in the same place you// found this script: // trainingsample.csv, a file that contains 5,000 examples, and // validationsample.csv, a file that contains 500 examples.// The first file will be used to train your model, and the// second one to validate the quality of the model.// 1. GETTING SOME DATA// First let's read the contents of "trainingsample.csv"*)typelabelPixels={label:int;pixels:intarray}letslurp_filefile=List.tl(read_linesfile)|>List.map(funline->Str.split(Str.regexp",")line)|>List.map(funnumline->List.map(fun(x:string)->int_of_stringx)numline)|>List.map(funline->{label=List.hdline;pixels=Array.of_list@@List.tlline})|>Array.of_listlettrainingset=slurp_file"./trainingsample.csv"(* // 6. COMPUTING DISTANCES// We need to compute the distance between images// Math reminder: the euclidean distance is// distance [ x1; y1; z1 ] [ x2; y2; z2 ] = // sqrt((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2) + (z1-z2)*(z1-z2))*)letarray_fold_left2facca1a2=letopenArrayinletlen=lengtha1inletreciteracci=ifi=lenthenaccelseletv1=unsafe_geta1iinletv2=unsafe_geta2iiniter(faccv1v2)(i+1)initeracc0letdistancep1p2=sqrt@@float_of_int@@array_fold_left2(funaccab->letd=a-binacc+d*d)0p1p2(* // 7. WRITING THE CLASSIFIER FUNCTION// We are now ready to write a classifier function!// The classifier should take a set of pixels// (an array of ints) as an input, search for the// closest example in our sample, and predict// the value of that closest element.*)letclassify(pixels:intarray)=fst(Array.fold_left(fun((min_label,min_dist)asmin)(x:labelPixels)->letdist=distancepixelsx.pixelsinifdist<min_distthen(x.label,dist)elsemin)(max_int,max_float)(* a tiny hack *)trainingset)(*// 8. EVALUATING THE MODEL AGAINST VALIDATION DATA// Now that we have a classifier, we need to check// how good it is. // This is where the 2nd file, validationsample.csv,// comes in handy. // For each Example in the 2nd file,// we know what the true Label is, so we can compare// that value with what the classifier says.// You could now check for each 500 example in that file// whether your classifier returns the correct answer,// and compute the % correctly predicted.*)letvalidationsample=slurp_file"./validationsample.csv"letnum_correct=Array.fold_left(funsump->sum+ifclassifyp.pixels=p.labelthen1else0)0validationsamplelet_=Printf.printf"Percentage correct:%f\n"@@float_of_intnum_correct/.float_of_int(Array.lengthvalidationsample)*.100.0
Now the non-parallelized OCaml version handily beats the parallelized F# version(by about 5 seconds):
real 0m11.686s
user 0m11.529s
sys 0m0.092s
…though the code is a good bit longer than the F# version which weighed in at 58 lines.
Observations:
Be very careful about where you’re doing integer vs float math ops; stay with ints as much as possible (so long as you’re not sacrificing accuracy).
float array is preferrable to float list because the floats are unboxed in the former.
Don’t be obsessed with making an implementation in one language look like the implementation in the comparison language. I think I was focusing too much on the similiarities of the languages to see some of the optimization opportunities. Also, since I wrote the F# version first, I tended to make the OCaml version look like the F# version (usage of minBy, for example, and implementing my own minBy in OCaml instead of using camlspotter’s Array.fold_left method.)
The parallelization point remains: It’s much easier in F#. How fast would this be if OCaml’s Parmap used some kind of shared-memory implementation? The F# example was about 1.4X faster with Parallel so that would get the OCaml version down to ~8.5 seconds if a similar mechanism existed.