Determine file separator in run time

I found out recently that a data I had written code to ingest could (rarely) use a different file separator.
The assumption and examples so far had all been tab separated, but we now needed to support three other file separators without knowing until the file comes.
So I wrote a small snippet to figure out what the current file’s separator is, to which the calling code uses this accordingly.


object DataSeparatorUtils {
private val tabRegex = "\\t"
private val commaRegex = "\\,"
private val semiColonRegex = "\\;"
private val pipeRegex = "\\|"
private val dataSeparatorRegexOptions = Set(tabRegex, commaRegex, semiColonRegex, pipeRegex)
private val regexToSeparatorMapping = Map(
tabRegex -> '\t', commaRegex -> ',', semiColonRegex -> ';', pipeRegex -> '|'
)

private case class SeparatorInfo(regex: String, occurrences: Int)

def determineSeparator(line: String): Char = {
val lineWithoutQuotedContent = line.replaceAll("\\\".*?\\\"|\\'.*?\\'|`.*`", "")

if (lineWithoutQuotedContent.isEmpty)
throw new IllegalArgumentException("Content contains no separator. Unable to parse.")

val chosenSeparatorInfo: SeparatorInfo = dataSeparatorRegexOptions
.map(regex => {
val occurrences: Int = lineWithoutQuotedContent
.split(regex, -1)
.length

SeparatorInfo(regex, occurrences)
}).maxBy(_.occurrences)

val chosenSeparator: Option[Char] = regexToSeparatorMapping.get(chosenSeparatorInfo.regex)

chosenSeparator.getOrElse(throw new RuntimeException("Failed to determine data separator. Unable to parse."))
}
}

Advertisements

Convert dates of any format to unix timestamp

I wrote a simple date utils class that will convert dates in any format that you specify in the supportedDateFormats set into a unix timestamp. A lot of people prefer to store dates in this way and have to work with more than one format, so I thought this might be useful.


package com.grimey;

import java.time.*;
import java.time.format.DateTimeFormatter;
import java.util.*;
import java.util.stream.Stream;

public final class DateUtils {
private static final Set supportedDateFormats = new HashSet(Arrays.asList(
"dd-MMM-yyyy HH:mm:ss",
"yyyy-MM-dd HH:mm:ss"
));

private DateUtils() {}

public static long dateTimeToUnixTimestamp(final String dateTime) {
return supportedDateFormats
.stream()
.flatMap(format -> convert(dateTime, format)
.map(Stream::of)
.orElseGet(Stream::empty))
.findFirst()
.orElse((long)0);
}

private static Optional convert(final String dateTimeString, final String dateFormatString) {
final DateTimeFormatter dateTimeFormatter = DateTimeFormatter.ofPattern(dateFormatString);

try {
final LocalDateTime localDateTime = LocalDateTime.parse(dateTimeString, dateTimeFormatter);
final ZonedDateTime zonedDateTime = ZonedDateTime.of(localDateTime, ZoneId.of("UTC"));
final long timestamp = zonedDateTime.toEpochSecond();

return Optional.of(timestamp);
} catch (Exception e) {
return Optional.empty();
}
}
}

Country name to ISO code mapper

I was looking for a library that allowed me to pass in a country name and get the ISO code for it.
Unfortunately I couldn’t find one, so I wrote a tiny bit of code to achieve this without having to create a huge hard coded map like most of the threads on Stack Overflow suggest.

Here’s the code:

package grimey.com;

import java.util.Arrays;
import java.util.Locale;
import java.util.Map;
import java.util.stream.Collectors;

final class CountryService {
private final Map countryToISOMap;

public CountryService() {
countryToISOMap = Arrays
.stream(Locale.getISOCountries())
.collect(Collectors.toMap(isoCode -> {
final Locale locale = new Locale("", isoCode);

return locale.getDisplayCountry();
}, isoCode -> isoCode));
}

public String countryToISO(final String country) {
return countryToISOMap.getOrDefault(country, "");
}
}

Using completable futures in Java

This week I needed to download a lot of files on a remote server using multi-threading.
I always try to avoid multi-threading and treat it as an absolute last option as it makes code significantly more complicated as well as introducing all the really hard to debug bugs that it brings (race conditions, thread starvation\locks etc).
Unfortunately, as the bottleneck was the network, I couldn’t optimise to speed up the solution. So after looking for an alternative, I gave in and looked up how to do multi-threading in Java.

After some research, I decided that out of all the various multi-threading API’s, the Callable Future API was the one to go with. This is because the code that needs to process the downloaded files should not execute until all files are present. The Callable Future API handles this case the best.

The first thing I needed to do was find out how many files were in the remote directory. This was achieved by using an SSH library call JSch (http://www.jcraft.com/jsch/).

Once I ran a ls command, I read the returned input stream into a hash set of strings and then chunked the hash set by the amount of threads I want to download the files. Each chunk of file names are fed to each thread to which they go off and retrieve those files using a SCP command.
When all the threads have completed their downloading, the Join function collects the reults (which is a SshDownloadResult class that encapsulates sucessfully and unsuccessfully downloaded files).

Here is the code:

final ExecutorService executorService = Executors.newFixedThreadPool(threadsCount);

final List downloadTasks = IntStream
.range(0, threadsCount)
.boxed()
.map((threadIndex) -> (Callable) () -> {
final HashSet filesNamesForCurrentThread =
chunkedFileNamesToRetrieve.get(threadIndex);
return download(connectionSettings, absoluteDownloadFromDirectory,
downloadToDirectory, filesNamesForCurrentThread);
}).collect(Collectors.toList());

final List futures = downloadTasks
.stream()
.map(task -> CompletableFuture.supplyAsync(() -> {
try { return task.call(); }
catch (Exception e) {
log.error("Download thread " + task.toString() + " threw an exception.
Message = " + e.getMessage(), "runUsingSsh");
return null;    // Todo: Handle this better.
}
}, executorService))
.collect(Collectors.toList());

final List downloadResult = futures
.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());

Insertion sort in C#

Following the same theme as the last two posts, here is a Scala style implementation of the insertion sort in C# https://github.com/CSGrimey/MiscSnippets/blob/master/Code/Algorithms/InsertionSorter.cs

The code for this is way simpler than I expected. The bubble sort is supposed to be one of the simplest sorting algorithms, yet the code is a lot more involved than this insertion sort code.

An attempt to merge sort functionally in C#

In a month I’ll be starting a new job which will be focused on big data using Scala and Apache Spark.
In preparation, I’ve been re-familiarising myself with Scala by implementing various of the most common data structures and algorithms in Scala.

When comparing how C# (my current occupation programming language) programmers and Scala programmers implement the merge sort, I noticed they are usually quite different in approach. This is partly due to C# and Scala implementing their list data type very differently.
So to help me with my understanding of the merge sort, I decided to try and write it in C# in a way that people generally implement it in Scala.

This github link shows what I came up with.
https://github.com/CSGrimey/MiscSnippets/blob/master/Code/Algorithms/MergeSort.cs

I think this has helped my understanding of the logic and implementation of the merge sort, so I may continue doing this with other algorithms. As that forces me to really understand what is going on instead of memorising the code.

Join two tables based on value only present in one

I needed to join two tables where one is a list of orders, and the other is a list of items in the orders.
These items can only be for orders created by a specific customer.

The orders table has a foreign key via the customer ID, but the items do not.

You could get all the items, and then load it into memory and filter based on the order’s customer ID value, but doing this uses up a lot of memory and is very slow. I needed to do this as lazy as possible.

Fortunately in LINQ, you can join based on anonymous objects that you can create using the table values and other values not in the table. And as it’s all done via IQueryables, so the execution is deferred as late as possible.
int customerID = 12345;

return from item in context.items
join order in context.Orders
on new { orderID = item.OrderID, customerID = customerID } equals new { orderID = order.ID, customerID = order.CustomerID }
select item;

So even though there is no customer ID column in the items table, the anonymous object I created has one that is set to the same value as the customerID I’m looking for. So as long as the customer ID value in an order row matches that, I will get the items I want.

Deep property filtering inside ng-repeat

Something that I’ll likely need to remember how to do in future is filtering a list by an element’s property. This can be taken further by filtering by that property’s property and so on.

 <div ng-repeat=”prescription in orderData.prescriptions | filter: {patient: { id: selectedPatient.patient.id}}”>

Just a matter of creating an inline JSON object that reflects the path to the relevant property. Much simpler than expected.